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The purpose of this conference was to share the 
information gathered by the Clea^ringhouse for Applied Performance 
Testing (CAPT) such as informatioii on performance testing that could 
be used in public schools, and secondly, to discuss problems that 
must be solved, issues that should be addressed, and additional 
research and development needed in the area of Applied Performance 
Testing (APT) . The presentations by the Clearinghouse dealt with the 
state of the art of APT, an overview of Clearinghouse activities, 
instructional materials developed on APT, and guidelines for the 
evaluation of APT materials and procedures. The invited address by 
Saul Livisky was pres€fnted next* This was followed by small*^ group 
discussion reports on problems, issuer, and needed research 
development in APT. In the next section individual papers are 
presented discussing these problems, issues, etc. Appendices contain 
participants^ in this 1 975 conference, handouts accompanying the 
invited address, and guidelines for the evaluation of APT materials 
and procedures.- (RC) 
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^ INTRODUCTION 

James R. ■ Sanders 
Clearinghouse for Apiplied Performance Jesting 

r" 
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The Clearinghouse for Applied Performance Testing (CAPT) and the 
National Council of Measurement ijijjiucati on (NOME) would like to ^ 
welcome you to the National Conference on, the Future of ^Applied * 
Performance Testing. Basically, we see this as a working conference 
rather than a did'a^ic session. The purpose of the conference is two- 
fold. First, we'wish^to share with yoii information that the Clearing- 
house has gathered ovfer the last nine months, such as information on 
performance testing that could be used in publig schools. Second, we 

wish to provide an opportunity for discussion of problems that mus:t- be 

■ . * . . 

solved, issues' that s'heuld be addressed, and additional research and . * 

development needed in the area of applied performance testing. We 

look to you, the audience, -for direction to guide that discussion.- 

• EveV-yone at the conference is involved in applied performance 

testing at some level. In essence, we have most of the .experienced and 

knowledgeable persons in the field' qathered"in this room.' This has 

significant implications in terms of what can be achieved. This conference 

does not represent' an. isolated effort: We hope to lay out specific goals 

for ourselves--and for others in this field--and then reconvene next year 

to see how well we have achieved those goals and to plan the next steps. - 
■ ■ ^ ] . ■ • 

'This conference is to be structured in the following way: first, 

'we. would like to describe the operation 'of the Clearinghouse and to 

share some of the information we have collected'. The best wa_y to do 

that is to ask members of the Clearinghouse Policy Board to talk with- 

• , • 8 ^ 3 \ 



you briefly about some activities for which they have takefi responsibili- 
ty., Second, we* have invited Mr, Saul Lavisky from Hum RRO to address the 
group this afternoon. HumRRO is one of the oldtimers^in terms of applied 
performance testing and training. Because of his long-term/ experiencfe, 
Mr. Lavisky can present a perspective that many of us have not had the 
opportunity to develop, and we appreciate his willingness to share that' 
with us. 

Following Mr. Lavi sky's remarks, we will break into smaller groups 
to address discussion questions the 'conference staff have laid out.' 
Groups will be formed on 'the basis of professional role, with adminis-. 
trators in one group, curriculum specialists in another, and measurement 
and evaluation specialists in a third. 

The small group sessions^ will be task oriented. We have asked each 
group to address three questions from their particular perspectives as 
representatives of a specific profession. The first is. What (problems 
3re involved in the development or use of applied performance testing in 
public schools? . Second, Wifat issues arise when applied performance tests 
are considered for u^e in public schools? Finally, What research and 
development efforts are needed in the aYea of applied performance test- • 
ing? We have asked one participant from each group to serve as a; recor- 
der and provide the larger group a summary of the small group's discus- 
sion of each question. We will reconvene later this afternoon to hear' 
those reports. . 

For the evening session we have asked four discussants, all people 
who are extensively involved in applied performance testing, to share 
with us their thoughts about what direction apolied performance testing 
is- now taking. The four discussants are Joseph Boyd from the^ Educational 
Testing Service; Hulda Grobman from the University of Illinois, College 



of Medicine; William Osborn, another HutnRRO representative; and Ruth ^ 
Nickse from Syracuse University Research Corporation. Each of these four 
people has agreed to use his or her expertise in helping us effectively 
address the topic to be covered during the course of this conference. 

Members of the Clearinghouse Policy Board have agreed to summarize 
for you the work that the Clearinghouse on Applied Performance Testing 
has- done this past year. Let me provide a brief background for their 

f 

remarks. The Clearinghouse was established in July, 1974, through a 
gran^ from Title V, Section 505 of ESEA to four participating states: 
Hawaii, Oregon, Pennsylvania and Washington. Subsequently, two other 
projects were added to the Clearinghouse effort: one by the U. S. 'Office 
of Education, Office of Planning, Budgeting and Evaluation, to collect ^ 
and evaluate measures of functional adult literacy; and one, initiated . 
^through the Department of Defense, to searc'^h for occupational certifica- 
tion measures. 

The Clearinghouse Board has asked the Northwest Regional Educational 
Laboratory to oversee day-to-day Clearinghouse oper^tiojis and we have a 
Clearinghouse staff a,t the Laboratory; Mr. Thomas Sachse is here repre- 
senting that staff. / 

The Policy Be^ard- members are: Dr. Janet I. Sumida, Director of 
Statewide Assessment at the Hawaii Department of Education and Project 
Administrator for the Clearinghouse project; Dr. James Impara, Director 

♦ 

of Statewide- Assessment at the Oregon Department of Education; Mrs. 
Pauline Leet, Director of the Bureau of Curriculum Services at the 
Pennsylvania Department of Educatioii,,; and Gordon B. Ensign, Jr., Super- 
visor of Program Evaluation with the Washington Superintendent of Public 
Instruction's office. Since each Policy Board member^^s undertaken 
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specific tasks on applied performance testing, I will let them now de 
scribe for you what they have been doing. 
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APPLIED PERFORMANCE TESTING— THE STATE OF THE ART 

James C. Impara 
Oregon State Department of Education 



Applied performance testing is not a new concept. In fact, applied 
performance is probably one'*of the oldest farms of testing known. How- 
ever, it is'diffi-cult to find measures which hav^ bpen "standardized" so 

that the adm1nicstration,»scoring and interpretation are reliable. Ex- 

/■■ 

ii^tions to this occur in a number of military settings and ill some vo- 
cational settings^ but it is not the case that ^measures are available * 
fis^ the more "mundane" activities (performances) each of us encounters 
on a regular basis. - • 

In an attempt to learn what currently exists in the field of applied 
performance testing-, a literature search for tests or informal papers 
was conducted. ^ In addition to the literature search a survey was con- 
'ducted to obtain materials relevant to applied performance testing. 

^ The literature^search focused mainly on publications and projects 
developed during the las^t five years. It was learned, however, that 
military sources of information required more extensive research since 
performance testing had been employed by the militarysince World War II. 
Searches of computer^ informatioij bases were conducted to reveal addition- 
al sources of Information. ,^ . . 

*^ Results of the searches varied widely (even ^^ithin the same system) 
depending on the search strategy and descriptors used. This variability 
stemmed from the fact that descriptors used within the systems did not 
correspond to current notions of perfornr^nce. This proved a. complex 
problem. Not only did descriptors fail to match our descriptions (making 

- ■ iS ■ ■ „ 



access difficult) but descriptors assigned when documents were entered 
into the system were couched in a dated vocabulary. 

To aid in the, search for m'aterials, a subcontract was given to 
Adrian Vari Mondfrans of Bringham Young University. The BYU staff com- * 

pleted a literature survey as well as a field survey of applied \ "^>^"l^or- 

.1 ■ ' ' ^ 

mance assessment activity. The literature sjjrvey netted 350 annotated 

references. Many o^f these references duplicated the present Clearing- 
house materials; .however,' there were enough new references to convince 
the Clearinghouse staff that the external search activity had been bene- 
ficial to the project. 

The field survey employed a questionnaire sent to 600 Individuals 
throughout the country. In this survey special emphasis was placed on 
determining the need for an availability of instructional materials and 
measures for applied performance assessment. Unfortunately, the return 
rate of this questionnaire was quite low--perhaps because it had been 
sent just prior to Christmas ,1974. However, a fol Tow-Lip study retrieved 
some additional data.- • - . 

? By winter of 1975> niariy projects had been identified in^the ^ield 
and approximately 30 major centers of activity for field activity were 
noted. In an attempt to Coflect current information atout new develop-"" 
ments in the field, *the poficy board and. staff visited these projects. . 
Although .some projects were nat as deeply' involved in testing as origi- 
nally believed, fthese si te-vrsits proved beneficial since most projects 
had materials and references that were of great. util ity to. the Clearing- 
house and Clearinghouse users. ' . • * 

The data collection activities described abo>e portray somewhat the 
** 

state pf 'the art in applied performance testing*. As fnight be expected, 
applied performance testing is well developed in subject matter areas in 

14- ■ ' • 



which the product of the education requires the ability to'^'erform. 
Occupational fields,, such a^ carpentry, mechanics, clerical skills and 
masonry rely on both performance tests and complementary paper and pencil 
tests to certify occupational competency. Professional occupations— es- 
pecially.- the medical arts and teaching—have be^en very active <in using 
performance training and testing. The military and private industry 
have also used Performance testing ektensively . 

Simulation is a well -developed' facet of applied performance testing. 
Business? and the medical arts ard* proficient users of simu\atibn and 
gaming techniques.^ Simulation has' some distinct advantages o\/er perfor- 
mance testing, including reduced cost, increased sampling qf behavior,- 
and the -possibility for variation while maintaining standardizatiqn. 

Although traditional public school content areas often lack 'appl ied 
performance testing devices, increased interest and development in basic 
Skills assessment will st)on change this. A growing desire for assess- , 
ment of school subject matter in terms of Tife skills will require -addi- 
tional measures of an applied performance nature. As a result of Clear- 
inghouse activities, some technical instruments, for measurement' of public 
.school content areas are becoming available. 

The unwillingness of some developers to share their ideas and, pro- • 
ducts has been a major problem^for the Clearinghouse. This unwillingness 
seems to stem from two different points of view. The first is that the 
producer of the measures does- not. feel that the measures are ready to be 

« 

released for wide-scale' use before further development. This problem 

might be classified as the avoidance of potential embarrassment because. 

of known or expected -flaws in the development of the measure. Whenever 

the Clearinghouse is aware of such a, circumstance, it has offered to 

guarantee the author's anonymity as well as provide feedback, on the 

•; n . . . 11 



materials to the author if desired; the Clearinghouse is very concerned 
that the developers of r/ew, incomplete materials be protected from po- 
tential embarrassment because of uncorrected flaws or errors, • The sec- 
ond point of view stems from the unwillingness of certain groups, to par- 
ticipate because they have expencled large sums of money or large amounts 
of time (or both) and do not wish their mateHals to be distributed" to 
those who have not participated in the development. 

It is hop£d that as the Clearlrighouse grows, both with- respect to 
the collection of material and' with respect to establishing trust and 
credibility that the reticence shown by some who are unwilling to sh^re^ 
will be relieved and the state of the art can grow even further. 
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AN OVERVIEW OF CAPT ACTIVITIES 

Thomas P. Sachse . 
Clearinghouse for Applied Performance Testing 

The Clearinghouse for Applied Performance. Testing (CAPT) was coop- 
eratively conceived, proposed and undertaken by representatives of NWREL 
and the four member states. This group, collectively designated the 
CAPT Policy Board, directs Cl^earinghouse operations. It should be noted 
that th^ Policy Board has done'an excellent job of delineating the tasks 
necessary to operational ize a clearinghouse of this type. Let me now 
describee 'for you the activities that bave shaped the present status of 
CAPT. 

Throughout the shor^life of the project, the CAPT proposal has pro- 
vided clear directions for developing a functional clearinghouse. In 
writing the proposal, the Policy Board took care tt> provide means for 
accomplishing six main objectives: 

1. Collection of applied performance testing materials. 

2. Formation of a consumer audience interested or involved 
in applied performance testing. 

3. Dissemination of materials of vital concern to potential 
users. 

4. Development of instf^uctional materials on applied perfor- 
mance testing-. • 

5. Development of criteria for evaluating applied .performance 
instruments.. 

6. Evaluation of the CAPT proj-ect. 

Members of the Policy Board are here today to di-scuss their roles 
, in the completion of these objectives. Nty overview 'of the CAfT project 
is intended to complement their remarks. 

• 17 * , - " . •13 



Dr. Itnpara has already described the results of important collection 
activities undertaken this.pa.st year. I will now delineate the tasks 
that lecf to those results. Let me first mention that, many CAPT activi- 
ties were conducted for multiple purpeses. For example, one of the ■ 
first activities was to publicly announce the establishment of CAPT and 
to solicit information- about persons or projects in the- field. This ac- 
tivity provided CAPT its first materials and began the formation of our 
consumer audience. These releases were sent to a variety of educational 
journals ^nd other informational publications. 

Numerous letters soliciting applied performance testing material s. 
were sent to workers in the field. Their responses provided additional -. 
materials to CAPT and further expanded the. consumer audience. Many re- 
searchers in the field were identified through another collection effort— 
namely, literature searches. During the past year, CAPT has conducted a 
literature .search at the NWREL Information Center, four different com- 
puter information- based searches and, as Dr. Impara noted, has contract- 
• ed with BYU to conduct an independent 1 iterature -search and field survey. 
Surprisingly, no search entirely duplicated previous efforts. As a 
result of these searches, additional applied performance testing mater- 
ials were contributed^to CAPT;^and additional workers in the field were 
identified. 

Collecting materials and forming a CAPT consumer audience are on- 
going activities. CAPT receives, daily, requests from persons wishing 
to be put on the mailing list, and regular contributions of testing 
materials. 

By November Ind December of 1974, major centers of activity had 
been identified, and plans ^were m^atde to visit these projects to col'lect 
information and materials* CAPT has received continued interest and ' 



support from those project personnel; more than half of our conference, 
discussants and reporters are from agencies vis.ited by CAPT. 

This National Conference is primarily viewed as a planning and dis- 
semination-activity. It is expected— and hoped—that our discussions 
will net new sources of information for CAPT. 

The formation of a consumer audience has been closely tied with 
collection efforts. As important projects or products i^n the field. are 
identified, CAPT contacts i»ndividuals responsible for development, as 
well as other interested persons, so that through information sharing 
all can benefit from others' endeavors. Persons who contributed mater- 
ials to CAPT are offeredV in exchange, an equivalent number of duplicated 
materials at no cost. Our concern in making the Clearinghouse!ef fective 
in dissemination as well as collection has resulted in an active consumer 
audience. ' 

A number of important tasks must be completed prior to dissemination 
When materials are first received, they are screened to determine their , 
appropriateness for Inclusion in our collection; Because applied perfor- 
mance testing is a Urge umbrella under which 'many testing devices and 
materials fall , this task jnay seem unnecessary. On the contrary, how- 
ever, a surprising number of con<^ibutiQns bear little relation to ap- 
plied performance' testing. They are^occasionally included, however,- be- 
cause even materials that seem only, tangential to the field are often 
requested by our consumer audience. 

CAPT was formed to collect testing materials for use at the public 
school level; however, most contributions fall outside this domain. The 
relative newness of the field and the use. of performance settings (occu- 
pational and adult education) are two factors that encourage development 
of applied performance testing in non-traditional public school subject 



fhatter areas. Receni demand fbr applied performance tests in public 
schools will shift the developmental emplasis, and CAPT will be ready to 
provide assistance where necessary. 

Once materials have been screened, they are referenced and cata- 
logued for user access, Subject matter, tpirget population, availability, 
grade level and testing mode are but a few of the variables by. which thfe' 
..materials are classified. 

. Availability is an area of particular concern to CAPT. . For many 
reasons, some of the finest materials collected are unavailable to CAPT 
in quantities adequate for dissemination. If such materials, are av-ail- 
able*from a specified source, CAPT tries to provide users ordering in- 
formation. 

Unfortunately, public school educators are not funded to obtain com- 
mercially available materials. Many excellent products— capable of pro- 
viding much valuable information—are still in developmental stages. 
The interest in applied performance testing Kas now developed far beyond 
the field's technical or financial capability to respond. 

CAPT'is presently annotating screened and catalogiied materials to 
provide users the kind of summary info_niiation they .need in requesting 
CAPT materials. Having to select materials strictly on the basis of 
title, author, institution, date, and number-^ pages is simply unsatis- 
factory. ^ ^ , 

Dissemination of collected materials and information on applied 
performance testing has been 'and remains the ultimate goal of CAPT. 
CAPT was established to provi^ie materials to those with an' expressed 
need; filling expressed and perceived needs is a constantly expanding , 
activity. During the first months of CAPT, the emphasis was on coll ec-^ 
tion, now we must "deliver the goods" to interested parties. Although 
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collection and dissemination are complementary, ongoing activities, 
shifts in ^emphasis. do occur over time. 

At the- inception of the CAPT project^, an informational brochure was 
distri^)uted to (1) announce the establishment of CAPT and (2) solicit 
applied performance materials. That brochure was also designed to in- 
troduce people to a concept in measurement that they may have overlooked 
in relation to public schooT testing. - 

. ' The CAPT Newsletter, published bimonthly, has been useful in keep- 
ing readers informed of new developments in the field, CAPT ;aH:;^tivi ties 
and projects and publications relating tp applied performance testing. 
.The January <:APT Newsletter included a I'ist of References Related to Ap- 

plied Performance Testing, This document provided read'ers an imtial 

■ % 

look at the then current statu? of publications in the field. Ordering 
information was included. * The annotated bibliography of CAPT resources— 
an updated, version of that list of references— will be released in May, 
as will The Synthesis Survey of Applied Performance Materials , a state- 

. of-the-art document 'Of all references encountered by CAPT. , 

Dissemination of CAPT materials made available through the "Refer- . 
ences Related" document has been constant since the January Newsletter 
was issued. In additiou.N CAPT has responded to inquiriesvfor help in 

r applied performance testing and to requests for assistance in statewide 
assessment, individual requests for CAPT assistance .are handled by the 
CAPT staff or by member state representatives to CAPT. These requests 
vary greatly and many specialized needs cannot be met. In the event 
that a testing device is nctt currently available for the subject matter 
.or audience, materials or devices that can.be adapted are recommended. 
In the event that CAPT cannot meet an individuaTs need, the request ..is 
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• filed and reactivated when new relevant materials are collected. 
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CAPT has also priDvided support to various 'Statewide assessment ac- 

• *^ ^ • 

tivities. The State of Hawaii is currerl1;ly pilot-testing appVied per- - 

formance exercises for use in- statewide .elssessment. Oregon and Penns^T? 

vania have indicated a need for applied performance measures of citizen- 

ship, and CAPT is assisting jn the development of testjing material s for 
* • 

this important subject ^matter; CAPT has provided applied performance^ 
materials' to non7member states as welV; . . 

Through a variety of apfiroaches, CAPT" h&s attempted to ^acquire,,a 
'national s^ope. For example^,- publicity releases were issued to national 
publications and regional publications outs'ide of member states.^ CAPT 



.has received contributions and requests to be put on the mailing lift 

# • * » ^. ^ 

from persons in almost every state. The CAPT audience is; however^ con- 

ctntrated in member states, because of extensive publicity provided 
through member state pub] i cations, the influence of member' state repre- 
sentatives, and dissemination policies for member states--CAPT materials 

V f ^ ^' J 

are free to agencies .within member states.^ ' ^ - ' ' v 

* • ' . . ■ " «• . 

the reasons for seeking a national scope are (li) tb.irt.crease inter- 
est in and contributions to' CAPT J;iol dings, (2) to decrease duplication • 
of development efforts, and (3)* to determine' appropriate future activi- 
ties for^CAPT and those interested injDromOting the field of applied 
parformance testing. ' * 

^This National Conference represents a culmination of efforts to 
achieve national scope. Bepause the National Council of Measurement in 
Etiucation (NCME) is co-sponsoring the Confererlbfe, announcements'^ of the 

--^ iter . 

Confe>^ence were sent to the entire NCME mailing list^ ' 

CAPT is responsive to problems^in the field of educational, measure- 
ment and has attempted, to the"e!xtent f|iat available resources permit,, 
to relate applied performance testing'to larger educational goals. This 
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past spring, CAPT invited three well-known measurement specialists to 
discuss different vapproaches to measuring^student comVetencies. Dr. 
Robert Ebel , of Michigan State University, discussed traditional ■norm- 
referenced testing; Dr.- W. James Popham, of the University of Cal ifornia / 

m 

at Los An^geles, dealt with domain and criterion-referenced flieasurement; 
and Dr. William McClelland, .of the Human Resources Research Organizat 

I 

reported on applied performance^ testing. 

No consensus concerning th? most effective approach to competency- 
based testing was reached. 'The group^noted a need for a-more adequate 

■ m 

definition of the term "compelency-based measurement," and .proposed spe- 
ciffc approaches to competency-based assessment j'n Oregon. The comments 
were" made in the context of legislative mandates- for competency-based 
raeasuY'ement in Oregon. - ■' , * 

CAPT has been represented in the NCME Task Force on Competency 
Measurement. This Task Force has been asked to identify major 'issues 
concerning competency-based measurernent, and to suggest strategies and 
directions for future research* >^ 

.the Clearinghouse Policy Board is concerned with advancing the field 
of applied performance testing, and to this end has begun developing 

various facets of the field. Each state representative has taken r.espon- 

/ 

sibility for one or more CAPT activitie's in additi-on to Policy Bopd dir.- 

/ 

ection, site visitations, and state responsibilities. 
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DEVELOPING INSERVICE EDUCATIONAL MATERIALS 
» FOR APPLIED PERFORMANCE. TESTING 

William Gauthier, Jr., Bucknell Uni^versity, 
Pauline Leet, Pennsylvania Department of 'Education and 
Hugh F.. McKeegan, Bucknell University 



Introduction 

One objective of the Clearinghouse foV^ Applied ReKformance Testing 
is' the pro.di^ction anc'. evaluation of instructional materials on "the defi- 
nition, attributes, development and use of applied performance procedures 
and materials for student assessment."' Discussion of this work will com- 
plement the literature se^arch and field survey conducted by Richard Kay 
and others (1975), -ancJ^tl to the resources already available from NWREL, 
eric", HumRRO, and other relevant information , sources. The primary em- 
phasis of this wo>k is on the "procljjction of new instructional materials 
relevant to consumer needs." - 

Rationale , ^ . 

Many important outcomes of elementary ahd secondary education are 
defined through an "if x then y" kind of relationship. . In other words', 
if a' student masters skill ",x," he has mastered,, or probably will master, 
skill "y." Certain "x" skills— suCh as reading, writing, computing and 
speaking— can be assessed directly; other kinds of cognitive -competencies 
such as the degree of mastery of^^^^^s^condary level course in history or 
literature, are usually evaluated by sampling the behaviors wh'ich the 
course is purported to develop. Whether one uses a criterion or norm- 
referenced approach, "x" type/ learnings can be assessed rather reliably 
for the purposes of ttie school using a variety of direct and indirect 
""measurement techni-q^jjes. Traditionally the "y" kinds of outcomes- (e.g., 
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good citizenship, work habits, social responsibility) are believed to 
• arise from command of subject matter *skil Is and immersion in the micro- 
society represented in the school. 

Applied performance testing requires educators to re-examine the ex- 
tent to which their assessment of type "x" outcomes is reliable and-valid 

r 

when it occurs in a context, either real or simulated, representative of.* 
the .macro-soc-iety. APT also demands a searching analysis of "y" type 
outcomes to determine (a) the^extent to which they can be operational ized 
and assessed directly, (b) the degree to which it can be logically in- 
f erred that mastery of "x" type outcomes will provide a basis H'or approp- 
riate behavior in "y"^type situations or conditions, Despite^our best 
efforts iji these analyses, there will always be relative, uncertainty 
about the behavior individual graduates will exhibit in complex real-life 

.^situations. Per^otiality factors, a-ttitudes; the nature of the problem, 

{ ' . 

-the si;tuational context in whvtch the problem is presented and the degree 
of originality in generating problem-solving strategies are but a few of 
the factors that affect real -life performance. Further, to paraphrase 
Margaret Meade, "We must often teach for what we don't know yet," for a ^ 
future tha-t is undefined, schools must stress^ ana-lytical and problem- 

. solving skills and procedures that will have general applicability. 
Nevertheless, education depends extensively on reducing'^uncert'ainty in 
behavior, and APT can contribute s^^gjii'ficantly to this effort by expand- 
ing and improving the assessment dev^ices and procedures used in schools. 

As this discussion indicates, the development of teacher inservjce 
materials for APT involves curricular as well as measurement and evalua- 

^ tion considerations. The survey conducted ^by Kay et al., together with 
a. variety of observations made by preservice and inservice teachers, 
suggests the extent of need in the measurement and evaluation area:- it 
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can only be described as enormous. Only a small proportion 'of teacrhers 
appear to have adequate command of measurement theory, classical test 
concepts, or newer criterion-based approaches. The inservice materials^ 
to be developed for APT will assume a basic knowledge of elementary con- 
cepts in traditional tests and measurement, but'lnost schools contemplat--^ 
ing the use of APT will need to structure their -inservice activities so 
that teacher competence in these prerequisite areas is assured. To this 
end, the materials will include references to selected materials and pro-^'' 
grams already available 'in the general area of testing and evaluation ' 
with particular emphasis on criterion-referenced measurement. Either 
voluntary or mandated use of APT will require individual teachers Ipar- 
ticipation in analysis and re-analysis. of curriculum priorities, as W^ll 
as the appropriate use of a variety of applied performance tests, irrdices, 
and observations. The inservice materials will be designed to contribute 
to effective participation in curriculum decisions impinging on APT and 
to the eff,e.cti^ve and appropriate use of APT procedures. 

Improving the competence '^f individual teacherS'-either- p-reservice 
■ or inservice—while certainly necessary, will not ensure that applied 
performance techniques are appropriately used in schools. In sumrt^arizing 
the research on educational innovation?, Spady concludes that "the fail- 
ure of many if not most innovations lies in the"failu»;e of schools to , . 
implement them adequately." Other observers, particularly' those involved 
in current developments in^Oregon, emphasize the enormity of the insti- 
tutional change involved in developing APT programs. Developing either 
teacher competence alone or administrator competence alone c.an only lead 
' to a great deal of personal frustration and institutional fragmentation 
in the implemeOvtatfpn of any sizable innovation. Thus, it would appear 
that there should be a sequenced development of the competencies of 
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decision m^akers at both the classroom arid district levels if APT is to 
be more .than another innovative fad. The nature Qf the decisions to be 
made and the kinds of information to be collected an4 processed are 
quite different at the classroom and institutional Jeve.ls. Administra- 
tors must concern themselves with such topics as determining needs, con- 
ducting discrepancy analysis, establishing priorities, securing staff 
commitment, developing goals and objectives, and implementing, evaluating 
•and refining pilot programs. And while they must have a cognitive under- 
standing of APT concepts and procedures similar to that of the teacher, 
adminiistra tors must also attend to all procedures and constrairrts in- 
volved in^ bringing about viable and defensible institutional change. To ' 
meet these diverse needs, the inservice materials being developed are 
two distinct but coordinated units. The first would focus on the infor- 
mationalneeds of the classroom teacher in implementing the specifics of 
■an APT program. The second would attempt to meet the informational needs 
of department chairmen, coordinators, principals, and superintendents who 
must establish institutional^pa^meters and priorities -for APT. 

- Constraints , 

The major constraints involved in developing the inservice mater- 
ials center' on (a) the state of the art in applied performance testing, 
(b) the availability of appropriate examples of applied performance test- 
ing, and (c) the. time frame in wfiich materials must be completed. Quite 
sophisticated applied performance strategies have been developed--par- 
ticularly by HumRRO—for use in military training contexts, and tests of 
an 'applied performance type have been developed by certain government 
agencies, industries and vocational education institutions. App.Hcabil- 
ity of these materials to the kinds of tasks in which public schools 
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desire applied performance assessment may be rather limited. While some 
elementally and secondary schools have developed applied performance tests 
or use applied performance procedures, examples of the^e kKids of ap- 
proaches are not readily available to schoois— especially teacher train- 
.i.ng institutions. The work of the Clearinghouse for Applied Performance 
Jesting in collecting fugitive materials and in encouraging their further 
refinement and standardization should do much ta alleviate this problem. 
Contractual 'deadlines and requirements -of the funding agency are such, 
however, ■ that the collection of materials and the development of inser- 
vice materials described here must be completed by the end of this fis- 
cal year. The nature of the task has been defined, literature searches 
have been conducted, and pv^eliminary outlines have been prepared. Never-' 
theless, the products, while they should prove useful in the,preservice 
and inservice education of teachers and administrators, mu$t also be con- 
. sidered as curriculum materials subject to formative evaluation and fur- 
ther revision. 

Description ^ 

The inservice mat&riatls will comprise two units each, a, discussion 
guide and references. One unit will be designed for the preservice and 
. inservice education of teachers and will include information on the defi- 
nitions of APT, appropriate and inappropriate curricular uses of APT, 
constraints in use, and procedures for development and evaluation of ap- 
palled performance tests. The second unit wi-ll focus on administrative 
andn'nsti tutional concerns in Implementing APT programs and will include 
components on needs analysis, systems development, pilot testing of APT 
based-xurrlcular and instructional systems, and formative and summative 
evaluations of APT programs. ^ 28 
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Each, unit will be tested in' the developmental stage with small sam- 
ples of preservice and inservice teachers and administrators, and their 
responses used as a guide to revision and improvement. 

Persons who can offer suggestion's regarding 'the development of the 
inservice .materials or references to extant. material s relating to the 
project are encouraged to contact developers c/o The Dep.artment of Edu- 
cation,, Buckne],! University, Lewisburg, Pa. 17837. 
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GUIDELINES FOR EVALUATION OF APPLIED PERFORMANCE 
TEST MATERIALS AND PROCEDURESl 

Janet I. Sumida 
Hawaii State Department of Education 



Need for the Guidelines 

As the Clearinghouse for Applied Performance Testing (CAPT) col- 
lect^,„pxQcessesrtttVd disseminates applied performance test materials, 
some preliminary screening of the materials is necessary to ensure qual- 
ity control. Guidelines for systematic evaluation and screening of the 
materials must be established and publicized so that users of the test 
materials can be selective. 

In developing new test materials, CAPT must also be guided by cri- 
teria for evaluating the adequacy of the materials. Other test develop- 
ers may also find it useful to have a set of established, accepted guide- 
lines to which they can conveniently refer. 
• I . 

Purpose of the Guidelines 

The guidelines are proposed primarily for use in evaluating the 
adequacy of applied performance testsi However, they may also be of 
help to test developers who must also be aware of the criteria for deter- 
mining the adequacy of applied perfov^mance tests. The guidelines do not 

provide specific procedures for developing applied performafice tests; 

2 

they are, according to Osbo,rn,,not "how-to-do-it" guidelines for test 



The complete set of guidelines are provided in Appendix C of this docu- 
ment, s , 

^Quoted phrase from William Osborn's paper on review of the first draft 
of the proposed guidelines, Macch 1975. 
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develop'ers, but rather a set of criteria for assessing whether they "have 

3 ' ' 
done it." They are intended essentially as a means of ensuring quality 

control of applied performance test materials and procedures. 

Ongoing Rfeview and Updating of the Guidelines ' - * ^ 

At the K-12 school level, developmental work in the area of ap- 
plied performance testing i? relatively new. Technical guidelines for 
development and evaluation of applied performance test materials and pro- 

'cedures have not been formally developed, studied, or^ written about .as 
extensively as those for other areas of. development and measurement. 

;Jhose ggidelines that have been compiled so far have been (d) "borrowed" 
wherever appropriate from literature pertaining >to traditional testing, 
or (b). newly developed, based on current CAPT staff experiences in the 
area of applied performance testing. In view of the way in which guide- 
lines were compiled, it was necessary that they be initially reviewed by 
test and measurement experts. ' 

The following criteria were proposed to the initial reviewers in 
their consideration of the newly compiled guidelines: 

1. Communicability of guideline statements . Is there a need for 
additionaldetails and further clarity? 

2. ' Technical soundness . Are the guidelines credible, based .on 

experience and available information? 

jf ' 

3. Usefulness . Is the guidel ines • ./appl icabi 1 ity potentially 
broad in scope? 

4. Relevance . Do the guidelines serve to fil.l a critical gap? " 

5. Updatedness . Are the guidelines consistent with current 
developments in the area of applied performance testing? 



■Quoted phrase from William. Osborn's paper on review of the first draft 
of the proposed guidelines, March 1975. 
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Thte guidelines are subject to refinement and updating; additional 
review and input will be solicited as they are more widely disseminated^ 
kor trial use.^ CAPT personnel would appreciate some discussion on the 
guidelines during today's small group sessions. Inf>ut from the initial 
group of reviewers acknowledged on your copy of the guidelines 'has been 
most helpful . - 

How to use the Guidelines 

Although "applied perfprmance testing" has been defined for CAPT 
purposes, identifying the tasks of different age groups— such as .fourth 
graders, eighth graders, and eleventh graders— as "applied performance" 
can be a problem. . We usually associate vocational or on-the-job compe- 
tencies with adults, but it appears necessary to view school age young- 
sters' 'competencies differently. Do we view students* competencies as 
preprequi sites to on-the-job or out-in-the-world adult survival compe- 
tencies? Perhaps schools can only provide indirect, inferential evidence 
that pupils are likely to^ behave completely because they possess the es- 
sential prerequisites to out-in-the-world and on-the-job competencies. 

When not equated with vocational competencies, measurement of stu- 
dents* competencies must be based on test- items that differ largely from 
those used in directly measuring vocational competencies, For example, 
we would not subject a fourth grader to a test of special stenographic 
skills but more appropriately to what may constitute a set of prerequi- 
site stenographic skills such as ability to organize, to carry on a tele- 
phone conversation, to alphabetize, or to greet visitors. Such prereq- 
uisites to out-in-the-world or on-the-job competencies tend to be general 
and applicable to many vocational areas. 
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We may wint to further break' down applied performance according to 
the way Ebel proposes to identify competencies: (1) cognitive/ (2) phys- 
ical, and (3) personal. According to Ebel, cognitive competency results 
from the assimilation of useful information to form a structure of knowl- 
edge and understanding; physical competency is a result of natural en- 
dowmen.ts developed by practice; and personal competency is a result of 
experience, imitation and adaptive behavior modification. Such differ- 
entiated competencies could represent the kinds of prerequisite competen- 
cies we speak of in relation to applied performance testing for K-12 stu- 
dents. 

In developing Hawaii's first statewide basic skills assessment 
package in the area of reading, we have attempted to identify those 
reading competencies that facilitate further learning and communication 
for the student. We have identified performance indicators to^ include 
the following: 

1. Understands meaning of words, word phrases, and word relation- 
ships, 

2. Demonstrates a positive attitude toward reading; reads, a 
variety of materials (including narrative, graphs, tables 
and charts) for various purposes. 

3. Locates and uses reading sources effectively. 

4. Follows written direfctions. 

5. Gets the main idea and supportive details from a reading 
selection. 

6. Reads critically. 

Therefore, we have put together a test package that appears to dif- 
fer from a traditional reading test. Hawaii's reading test package leans 
toward applied performance testing. Because the reading test package 
makeup is somewhat unusual, it has even been recently suggested by cer- 
tain-local developers of reading materials that we identify our reading 
O 30 
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assessment package by the set of perfomance indicators rather than iden- 
tify it strictly as a traditional reading test. We would like to consider 
our reading assessment package as including a large part of what is tra- 
ditionally covered' in a reading test as well as additional applied per- 
formance material. 

'fnstruments for the measurement of students' prerequisite compe- 
tencies will have to be evaluated for adequacy according to criteria that 
may be represented by some traditional testing guidelines. Ins'truments 
for the measurement of occupational competencies, on the other hand, will 
have to be evaluated according to less traditional guidelines. 

' ^ The present proposed set of guidelines therefore consists of cri- 
teria that may be used for testing of (1) general, prerequisite competen- 
cies, and (2) occupational competencies. When students' general pre- 
requisite competencies' are to be included as applied performance, we 
would have to accept applied performance testing as including a wide 
range of situations. Guidelines would then have to be viewed by users 
as applicable to many different test^ materials and procedures. Users 

must be selective in the application, of guidelines for evaluation of 

t 

unique instruments. 

It has been necessary to discuss ^the nature of^applie'd performance 
and related test content to arrive at a common understanding about the 
basis for the selection of guidelines presently proposed' for your review. 

t 

We have made a beginning in the search and development of guide- 
lines. -Your continued involvement and contributions are most appreciated. 
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;\ INVITED ADDRESS 

Saul Lavisky 
Human Resources Research Organization 



J 

be here ynth 



I am pleased to be here ynth you today, ag^d flattered to^have been 
invited, T am not being tinimodest when I alert you—beforehand— to the 
fact that I am not her^ beq^aussrof any special person'a.l expertise in the 
area of performance testing; I am here, rather, as the representative of 
an applied behavioralrscience research-and-developtnent organization which 
has— over the past 23+'years— developed, used, depended upon, and expan- 
ded both the theory and practice of performance 'testing* 

The organization I represent is HumR-RO--the^Human Resources Re- 
search Organization. Before. I get into -the "m*eat" of my presentation, I 
want to say a few words about HumRRO, because I believe it, will help you 
put my comments into perspective if you know s'omething about my organi- 
zation. 

HumRRO was created in 1951 as an office of The George Washington 
University. Our initial mission— and our so]e *mission until 1967— was 
to conduct "human factors" research for the Department of the Army. 
After a few years of "coveri'ng the field,^" we 'narrowed our focus to the 
area of training and education because we found that this was where we 
could have the most immediate and most* substantial impact on improving 
Army operations. Every officer and eVery , enlisted member of the Army - 
spends^ some time in training and/or education. In fact, when the Army: 
IS not fighting, it is training. 

By the late.l950's, it was quite clear that many of the advances 
HumRRO was making in the psychotechnology of training and education had 
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relevance for civil i an trainers and educators, too, (I say psychot eoh- 
nology, because psychology is the basic discipline of most— but not all- 
members' of the HuraRRO professional staff.) 

In 1963, I joined HumRRO in an "interpretive" role. As a long- 
time Army Reservist, I was familiar with the context in which HumRRO was 
conducting its research-and-development activities. And, with some ex- 
perience in journalism, in the public schools in South Carolina, and in 
the National Education Association headquarters, it was presumed that I 
Would be able to help "translate" HumRRO' s ^work for the Army for the 
benefit of civilian trainers and educators. 

In 1967, the HumRRO "charter" was modified to allow us to work for 
sponsors other than, and in addition to, the Army. And, in 1969, we 
separated from The George Washington University and became an indepen- 
dent, nonprofit R&D organization. We are headquartered in Alexandria, 
Virginia and we work for a variety of sponsors, military and civilian 
alike. 

I am here today, less as an "interpreter" than, as a "reporter." I 
want to tell you something about our experiences with performance test- 
ing, something about what we've learned over the past 23 years, and then 
I want to make some extrapolations from the training setting in which 
we've done most of our work to the education setting in which, I know, 
you conferees are primarily interested. 

The distinction between training and education is very important, 
in my opinion. I want to make it now, and I will come back to it later. 

Dr. Robert Glaser, in his book Training Research and Education , 
reminds us that the basic concern of both training and education is the 
modification and development of student behaPvior, and that both can be 
defined as components of "the instructional process." He suggests that 
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the training component refers to teaching students to perform similar or 
uniform behaviors. However, students display individual differences, 
and it is also the responsibility of instructional systems to guide the 
student's behavior in accordance with individual talents—in a sense, to 
maximize the individual differences. He refers to this activity as the 
educational component. --^'^ 

Dr. Meredith Crawford, in his chapter in the book. Psychological 
Principles in System'Pevelopment , agrees that both training and education 
are concerned with human learning, and that they both share common tech-^ 

nical problems of content and method. He makes the dist-mction in terms 

/ 

of purpose . Dr. Crawford says that training is undertaken to serve the 
needs of a particular system while education aims to /it persons to take 
their places in the many systems of society. 

Both gentlemen agree that there are instructional activities which 
are sufficiently different from each other to warrant two different 
labels, despite theTact that— from a- practical point of view— both the 
individual psychological processes involved and the technological prac- 
tices to carry them out, are the same. 

The kejil distinction for me is that the training program is intend- 
ed to prepare the trainee to fit into a particular system. This makes 
it possible to specify the desired end-products of learning. And if you 
can specify these end-products, then you can design instruction to train 
(or "build in'*) the desired trainee performances, and you can design 
evaluation procedures to assess how well trainees can perform, and how 
well the training program is accomplishing its purposes. Unhappily, 
those of us in education do not ''have it so good." We'll return to this 
distinction later. ' ^ ^ - , 
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Problems of 'terminology and definittons are not merely quibbling 
over "semantics." l^or example, let's take ttie term "performance tests.'* 
That's why we're' here today—to talk about performance tests. As though 
there were any other kind. All tests are designed to elicit and/or mea- 
sure performance. The original distinction was between tests that re- 
quired the use of language and those that did not. The original perfor- 
mance test was the form-board, an intelligence test for the deaf, the 
language-handicapped, the foreign-born. 

Even the dic.tlonari-es of psychology recognize the ambiguity of the 
term "performance test." One such dictionary identifies three uses of 
the term: (a) a test involving special apparatus, as opposed to a paper- 
and-pencll test; (b) a test minimizing verbal skills; (c) a work-sample 
test. The dictionary goes on to say that all of these us,es are unfor- 
tunate because the term "performance" already means^"the behavior of an 
examinee on a given test," and "the score of any specified examinee on a 
test," etc. 

And yet, we here today know, what modern educators mean when they 
use the term "performance test." Or do we? I have seen recent articles 
by prominent educators which dichotomized the field of achievement test- 
ing into performance tests versus "paper-and-pencil " tests, or "knowl- 
edge" tests. 

As my colleague. Bill Osborn--who is with us today— points out, 
this kind of labeling reflects artificial distinctions and is misleading. 
Bill reminds us that a true performance test for many clerical tasks 
would also be "paper-and-pencil." And if you wanted - . :s;^ss the per- 
formance of someone who operates an information center, you would^have 
to engage in "knowledge" testing. Even a multiple-choice test can also 
be a performance test; take the case of a surgical assistant who has to 
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select the- proper scalpel or other instrument at the cormiand of the* 
surgeon. , 

Incidentally, Mr. Osborn, who is the Director of the HumRRO Re- 
search Office in Louisville, Kentucky, 1s the HumRRO scientist who— at \ 
present--is most deeply involved in the whole area of performance test-' 
ing. He will be with you throu^ghout the day, will be one of this even- 
ing's discussants, and is much more, qual if ied than I-to answer your tech- 
nical questions. 

Performance testing, in the sense that I suspect most of us here, 
think about it, has a long and honorable history. It can be traced 
back to ancient Greece (as so many aspects of American culture can). 
The medieval Guilds in Europe tested apprentices. In this country, in- 
dustry has applied some form of performance testing since the Industrial 
Revolution. It picked up steam with the advent of the 'Scientific man- 
agement" movement fathered by F. W. Taylor at the turn of^the century. ^ 
It picked up additional steam, in the military aren^, during World War II, 
when the largest number of psychologists ever assembled on one project 
conducted the Army Air Forces Avaiation Psychology Program. 

Throughout most of these years, the use of performance testing was 
pretty well restricted to occupational performances—in industry, in the 
military, and later. in vocational education (which began as industrial- 
arts training). Several movements have conjoined within the past few 
years to bring performance testing to the forefront in general education. 
One of these has been the accountability movement . A second has been 
the behavioral -objectives movement . 

Now, the notion of accountability has been around for a long time. 
In the traditional pattern, the school administrator has been responsible 
for justifying, school-system performances to his political superiors— the 
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school board. The expectations of most Boards /STong these lines have 
been modest, and the administrators have not typically provided more 
justification than was required. 

The newer pattern involves a more specific set of expectations, 
more narrowly defined, with the powers-that-be calling for meaningful 
indices of school-system performance. It has been expected that admini- 
strators would provide effective educational programs and would make 
efficient use of the resources available to them for that purpose. But 
now they have to prove it. And prove Tt not only to the school board, 
but to other newly-involved groups, including taxpayers, who want to ' 
know what they're getting for the additional dollars being invested in 
education. . 

It would be an understatement to report that the accountability 
movement has not received the wholehearted support of the educational 
establishment. And it should not be forgotten that the movement was not 
generated within the educational establishment, but was imposed on it 
from the outside. 

Accountability is a goal -directed management process. So it is 
easy to see how it ties into the behavioral-objectives movement. This 
latter movement has . had, as one' of its principal purposes, making the 
goals of education more operational . 

I use the term "operational" in the sense of ''operational defini- 
tion." That is, the definition specifies the operations which define 
the concept. In this case, the behavioral objective specifies the be- 
havior which constitutes the objective- of instruction. I know that, .for 
this audience, I don't have to go into any detail about behavioral ly- 
stated instructional objectives. 
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So, here we have the confluence of one movement which says that 
school administrators have to specify their purposes and accomplishments 
in a way that is susceptible to assessment, and another movement that 
says "here is the way you can specify instructional objectives to niake 
'th6m measurable. " Don't they fit together nicely? 

What has this got to do with performance testing? Well, one of the 
precepts of the behavioral objectives movement is that, to measure stu- 
dent progress, you must measure what the student can do^following in- 
struction that he could not do before. The actipn word there i^ do: * 
what he can do_ following instruction that he could not do before. The 
objective tells us what behavior to look for, and under what conditions, 
and to what degree of proficiency. 

Thus, the behavioral objective not only serves the instructor by 
making it perfectly clear what the student is supposed to accomplish^ it 
also serves the evaluatar by providing a "model" test item or set of 
Items. Of cpurse,^in many cases, the instructor and the evaluator are 
one and the same person. 

We^ll come back to education in just a few minutes. Right now, I 
want to talk a little about HumRRO experience with performance testing— 
primarily in training , and primarily in the military training 'setting. 

As an applied research-and-development organization, we were ex- 
pected to conduct R&D that would "make a difference" in the Army's train- 
ing operations. Wg are still in business after 23 years; we xire still 
one of the Army's principal sources of R&D in training a'nd education; 
and we are entering into contracts with an ever-increasing number of new 
sponsors. Those are three pretty good indices that we have, in fact, 
made a difference with our work. 
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In. our early days, we spent a good bit of time working, on individ- 
ual Army curricula, or training-programs.' Essentially what we' did was 
to apply the best of what was known about traini^^g technology 'to -those 
prbgrams of instruction that were having trouble. We would come, up with 
a prototype, revised course; we'd compare the graduates of this experi- 
mental course with graduates from conventional courses; and if the ex- 
perimental-^course graduates performed better, or 1f they performed 
equally well follpwing training which took less timeor cost less money, 
we would recommend Army adoption and implementation. 

Yovi'll notice that I said we would compare experimental-course 
graduates with conventional -course graduates. This comparison is almost 
always made on the basis of a performance test (in the sense in which we 
here today are interested in that label). That is, we require the grad- 
uates of both courses to do their thing. 

That "thing" is usually some facsimile of the real -world job for 
which the soldiers are being trained. It is a performance test in the 
best sense of that term. In such tests, we attempt to stimulate the in- 
formation inputs to the trainee that would come to him if we were ac- 
tually on the job, and to measure job output— that is^ his proficiency 
at doing the job. 

Let me try to put the test into' perspective. In our course devel- 
opment work, we have taken what has come to be called "the systems ap- 
proach." Although t^ere are a number of variations, the HumRRO approach 
is shown -in this paradigm. 

The first step is an analysis of the^ operating subsystem in whiph 
the job of concern is located. ?his analysis provides information on 
the characteristics of both the hardware and human components of the 
system. It also gives some indication as to whether R&D efforts are 
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best invested in selection and classification of personnel, in human- 
factors engineering (that is, designing or redesigning the hardware to 
better fit the man), or in training. Let's assume that the answer came 
out "training," 

In the second step, thfere is an .analysis of the particular job 
about which we're concerned. We attempt to determine the inputs to the 
job' from the rest of the system, and the outputs that are required. 

It is important to note that development of the proficiency test-- 
the performance test—is derived directly from the analysis of the job. 

^ 

It is in no'way dependent upon what is taught in the eventual program of 
instruction. Ideally, we even put a separate group of researchers on 
this task—scientists who are not involved in the curriculum-development 
effort. 

The final step in the paradigm is the evaluation of the new curric- 
ulum. Obviously it is not the final step in the cfurriculum-development 
activity, and there could be lines and arrows to show feedback, boxes to 
'^'how revis-ion? dissemination, and implementation. But this is the core . 

There are technical problems with this kind of measurement, espec- 
ially with regard to choosing the praper research design. Egon Guba has 
writjten fo^cefully in several AERA publications to the effect that de- 
signs that are appropriate for basic research are not appropriate for 
curriculum evaluation. Dr. John L. Finan addresses these problems in his 
chapter of the Gagne book. Psychological Principals in Systems Development. 
And the AERA has published seven paperback monographs on the evaluation 
of curriculum-development projects in which several authors also address 
these problems. I will not attempt -to go into that kind of detail today. 

A' couple of years agbj I tallied up 88 training programs on which 
HumRRO had worked. Most of these involved the development of performance 



tests^-but not all. In some instaces, we produced only parts of training 
programs^but even in such c.ases, we usually went through some of the 
steps, that we consider fundam'ental to the development of performance 
tests — that is, job analysis and task analysis activities. 

We were extremely pleased, as researchers, when in 1966, the Army's 
training command issued a regulation on the "systems engineering of 
training:" that, essentially, adopted the HumRRO approach, and made it 
official Army doctrine. The Air Force, with which we had been sharing 
copies of our report, subsequently adopted a similar approach to "instruc- 
tional systems development." While we can't very well take credit for 
the Air Force decision, we did note with some pride that more than 50 
percent of the references cited in the AF Regulation were HumRRO reports. 

Because the development and use of performance tests are such typ- 
ieal HumRRO activities, a large number of our professionals have been in- 
volved with them. However, the performance test as a subject of study 
in its own right has been a matter of continuing concern to Bill Osborn, 
Director of our Louisville Research Office. Several years ago, he char- 
tered the major action points in the course of developing a test for 
training evaluation. 

let me take k moment to recap. - I've explained that HumRRO begSn 
by developing and using performance tests in connection with specific ^ - 
training programs as part of a general overall systems approach to cur- 
riculum development. I have shown you a diagram,^ and will provide you 
with a copy of a general ized statement of the HumRRO view of what's in- 
volved. in developing performance tests. I'd like to move a little closer 
to present-day by telling you something about a relatively new "model" • 
for performance-based training and testing that we developed for the 
Army, and that is now being implemented across-the-board. 
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Every week, instructors in the military services are confronted 
with incoming classes that must be taught a considerable amount in short 
and relatively fixed periods of time. These classes are usually quite 
heterogeneous with respect to students' educfitional background and learn- 
ing aptitude. From earlier research, by a large number of individual s 
and organizations, it was apparent that the traditional military lock- 
step, lecture-demonstrate-practice-test approach to instruction would not 
be particularly effective for trainees at either end of the ability spec- 
trum--the low-aptitude and high-aptitude personnel. 

HumRRO was asked to come up wJth a new approach to Army training. 
It had to be both effective and efficient. And there were other con- 
straints. It couldn't cost any more than current instruction. .It 

> 

^couldn't require Instructors of higher caliber or greater sophistication 
in 'training. It could not require any siignificant increase in the amount 
of operational equipment for practice, nor could it require any e^ension 
of the training period, or expensive instructional hardware or software. 
In sum, the n^w approach had to be fashioned out'of the currently avail- 
able resources. 

Under such constraints, the new approach, or "model as we like to 
call it, evolved as one in which the instruction of^. trainees by other . 
trainees is a central feature—that is, peer instruction.; 

There are six principal features to the model, in addition^ to peer 
instruction: 

(1) Modular Sequencing . The course is organized around a series 
of job-performance stations that represent the various duties performed 
by a person competent in the job. The number of stations is determined 

V 

by the number of coherent sub-jobs in a specialty. Since each station 
represents discrete sets tasks, a trainee can enter the system at any 
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(2) Self-Pacing . The period of time a trainee spends at any sta- 
tion depends on how long it takes him to learn to perform the tasks. 

(3) Insistence on Mastery . Each trainee undergoes a proficiency 
test (that is to say, a performance test) when he is satisfied that he 
has learned a task. He must demonstrate that he has mastered the neces- 
sary skills before he is allowed to proceed to the next task in the seq- 
uence. If he fails any test, he must review and practice until he can 
pass. Incidentally, for quality-control purposes, the tests are adminis- 
tered. not by the peer instructors, but by full-time cadre members, who 
are on hand as training supervisors. ' . 

(4) Rapid and Detailed Feedback to Trainees . Since proficiency 
tests follow each task, the trainee knows immediately whether he has 
learned the required skills. 

(5) Rapid and Detailed Feedback to Instructors . Since the train- 
er/supervisor administers the proficiency test, he knows immediately 

♦ 

whether the instruction has been successful. ' 

(6) Functional Context Training . Job-performance stations repre- 
sent 'actual on-the-job duties that must be performed, so the trainee 
actually learns the required skills and knowledges in a job-like setting. 

This new model was field-tested at Fort Ord, California, with sol- 
diers training to be Field Wiremen. It produced graduates who were 
markedly more competent at their job than conventionally trained wiremen. 
At the same time, it also reduced training time, training costs, academic 
recycling., and acade'mic failures. 

The Army immediately adopted the prototype program for all its 
field-wiremen training, throughout the United States. It also directed 
that this new, performance-oriented approach to training and testing be 
adopted throughout the Army. In recent months, we have. been helping Army 
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training managers implement this new model in courses for cook^, mechan- 
ics, heavy-equipment operators, air defense technicians, and infantry. - 

I am sure you can • recognize the. key role that, performance testing 
plays in this approach to training. • 

In 1971-72, we moved the model outside the Army and tested it in 
the public schools in a course on the office cluster of business occupa- 
tion^s. This test was conducted in the Pacific Grove Unified School Dis- 
trict. Test results indicated .that this performance-based instructional 
model produced graduates with statistically significantly superior job ' 
knowledge, who were dramatically superior in job performance than their 
conventionally trained peers. 

e 

We have introduced this model into a junior college in Vermont 
where the emphasis is on Adult Basic Educa.tion, and on occupational/vo- 
cational education, primarily for rural white ^dults. We have also in- 
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troduced this model into a Community Action* Project in Alabama,^where 
the concern is with training women for .occupations as household workers. 

'I recognize that your interest tgday is in performance tests, per 
se, rather than their ro^le in programs of this kind, no matter how inno- 
vative or effective. But my point here is, simply, that it is perfor- 
mance testing thg-t is driving the system . 

To this point, I have covered the past and the present. What 

4 
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about the future? . .. 

I come back again to Bill Osborn, and one of his current projects. 
In fact, I've quoted him and borrowed from him so often that it is clear 
to me now that it is he, rather 'than I, who should be addressing you. 
However, having typed this much manuscript with two fingers and a thumb, 
I refuse to relinquish the podium. I will press on. 
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Bill points out that the logic of developing a performance test is 
simple. You conduct a job/task analysis, recreate the job task in a test 
•setting, ask the trainee to perform the task, then record whether he did 
it or not. Unhappily there are many, many reasons why performance test- 
ing is not that simple. ^ 

. Let me cite but one example—and a ^'simple" one, at that--teaching 
someone to drive an automobile. Dr. A. James McKnight, on a 1971 HumRRO 
project, conducted a comprehensive analysis of the driver's task in order 
to identify critical driving behaviors from which instructional objec- ; 
tives^and test items eoUld be drived. He and.his colleagues found that 
in simply driving on an open highway there are more than 1,700 specific 
driving behaviors. You can imagine the number of instructional objec- 
stives and the number' of test items that would have been required if some 
process of distflla'tion, some determination of criticality, had not been 
undertaken. And there. is little in the performance-testing literature 
or job-analysis literature to guide the scientist in this distillation 
process. 

The two major evaluation tools the instructor has available are 
job-knowledge tests and job-performance tests. There is a question as 
to how results from the two types of tests correlate. Some researchers, 
in some settings, have found correlations so low as to indicate that job- 
knowledge tests are practically worthless for assessing individual pro- 
ficiency. Other researchers, in other settings, have found the correla- 
tion reasonably high. 

Practically everyone agrees that a performance t.est'is, in some 
way, better than a knowledge test. I think we would all feel happier, 
as instructors, if we could have our students do^ the job rather than tell 
us about doing the job. But the typical performance test is more 



expensive than the knowledge test. It sometimes requires too much ^. 
equipment, too many t&st administrators. And sometimes, thelevel of 
professional skill needed to develop and supervise 'administration of per- 
formance tests is simply not readily available. 

As Osborn points out, the training^^manager is faced with a choice 
between a practical evaluation tool with questionable validity on the one 
tiand (the knowledge test), and an impractical tool with high valic^ity on 
the other hand (the performance test). He feels that this Hobson's 
choice presents a fal?e dilemma~that there are other solutions that l ie 
in between these two extremes. He has proposed the concept of the "syn- 
thetic*' test and is busily enga^ed'these days in testing the concept. 

In fact. Bill suggests that there are a number of alternatives 
which fit between the two extremes, each one combining a differing mix- 
ture of validity and feasibility. Let me give an example. ^ t 

Let's assume we have the following performance objective. "Given 
binoculars, paper and pencil, and 20 targets in various degrees of con- 
cealment and orientation, at ranges of 500 to ?500 meters, the soldier 
will estimate and report the range to each target, accurate within two 
meters on 16 targets within 10 minutes." 

This objective consists of eight behaviors. If we took the soldier 
onto rthe range and conducted a full field test, we would be able to 
assess his performance on all eight. However, given a large number of' * 
soldiers to be tested, and only limited resources, the typical reaction 
is to conduct a paper-and-pencil test of the soldier's understanding, of 
the mil relation formula. This test addresses only two of the eight com- 
ponent behaviors, but it is easy to administer. 

I've already talked about the method 4 (the field test) and method 
6 (the paper-and-pencil test). The in-between methods represent 



alternatives of intermediate complexity. They were fabricated by con- 
sidering economically available ways of eliciting each skill and knowl- 
edge, aRd then synthesizing them into a test method. 

'In weighing these alternatives, note that as the simplicity of the 
test method increases, information on some component behaviors is lost. 
The simpler we try to get, the more information we lose. We eventually 
reach a point where it doesn't even make sense to give a test. Also, 
fhe simpler the test method, the more diagnostic information we lose--* 
that is; information that could help- us identify where our training pro- 
gram needs improvement.'* j 

The concept seems^^reasonMle. Mr. Osborn is doing more in the way , 
of conceptual development, is seeking empirical verification of his no- 
tions, and will eventually podify procedures under whjch test developers 
can use "synthetic" performance tests. 

I have taken longer than I intended to reach this point, but be- 
fore I conclude, I want to returni brief ly to the distinction between 

\ training and education that I made earlier. You remember that both Dr. 
Glaser and Dr. Crawford made the point that it was easier to identify 
training requirements- than educational ones. The job-analysls/task- 

" analysis approach doesn't have much application in the general-education, 
liberal -arts fields—at least not yet, so far as I can see. 

If you were to view instruction as taking place somewhere along a 

■ continuum that runs from the specificity of training to the generality 
of .education, you would find the concept of performance testing increas- 
ingly difficult to apply as you move from training toward education. It 
shouldn't be necessary for me to remind any of ^^.that a test--even a 
performance test— is only on^ tool for evaluation. This is even more 
true when you are af^praising individuals instead of instructional programs 
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Performance testing is only a leaf on the twig of tests and measurement, 
on the branch of eva,luation, on the tree of instruction. And, if I may 
be forgiven another Yimile, I hope that none of us will even be accused 
of being like the small boy who is given a hammer as a present; it's 
amazing how many things he can find around the house that need a good 
pounding. 

The examples* I've cited for you today have all come from the train- 
ing end of that training-education continuum I mentioned. To move toward 
the education end will take time, effort, imagination, and ingenuity. 
But those of us in the performance testing business won't have to go it 
alone. We're in good company because the instructional technologists 
and the systems analysts are all wrestling with the same problem. If, 
and when, we and they develop tools and techniques for reducing our 
global educational'goals to discrete, behavioral objectives,, the rest of 
the job will be much, much easier. 

In the original charge given me by the Clearinghouse, I was asked 
to conclude my presentation by identifying major gaps in our understand- 
ing of perforrfiance testing, and to suggest directions that future R&D 
might take. I have made several stabs 'in that direction, but must con- 
fess that I an> unable to carry out that assignment. I can only identify 
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gaps that strike me. as important. I suspect that, given our multiple 
purposes for wanting to use performance tests, we might each develop a 
different list. However, for what it's worth, here's m^list. 

First, I would like to see someone undertake a state-of-the-art 
survey of performance testing, and come up with a handbook or how-to-do- 
it manual that evaluators and teachers could use today . Not all evalua- 
tors are as sophisticated as they should be, and not ajl teachers have 
'the time to delve deeply into the subject. 



Second , we need that missing third taxonomy. We have Bloom, et al.- 
on the cognitive domain. And- we have Krathwohl, et al., on the affec- 
tive domain. But we don't yet have a suitable taxonomy for the psycho- 
motor domain. Dr. Fleischman and his colleagues at the American Insti- 

i 

tute of Research have been working in this area for some years, and their 
reports are both interesting and useful. They may be on the verge of the 
kind of taxonomy Tm talking about but, in any event, we need it, and 
fSoon. 

Third , it seems to me that. the major problem faced by those who. 
want to use performance tests in education is the problem of criterion . 
By yoiir interest in performance testing i you' have indicated an interest 
in moving education from. norm-referenced tests to criterion-referenced 
tests. But to do this, you, must have a criterion. And if your efforts 
are to be fruitful, you must' have an appropriately relevant criterion. 

Our colleagues in the human-factors engineering field are inter- 
ested primarily in human performance in man-machine systems. This is 
only one of the kinds of human performance in which we, as educators, 
are interested. And yet, in their relatively small area, they have con- 
siderable difficulty finding appropriate criteria on which to validate 
.their proficiency and predictive tests. How much more difficult our job— 
we who take' all of education as our territory. 

Fourth , I come to a closely-related probTem area (one I've already 
mentioned): the specification of observable^ measurable instructional 
objectives. We must find a way to operationalize our beautiful-but-ab- 
stract educational goals. There is an element of truth in. the accusations 
that we have, thus far, been able to develop measurable objectives only 
for the "trivial" 'Outcomes of education. There are, in fact, important 
outcomes that we have not yet been able to express in behavioral terms. 



And the more abstract the 'outcome, the more difficult the task— both 'for 
instructional design, instruction, and evaluation. 

♦ • ■ 

Fifth, V the two foregoing problem areas can be incorporated along 
with the problem of determining what ought to be the goals of American 
education. This is«not a problem for which educational evaluators have 
any unique responsibility. On the contrary, everyone in education *(and 
outside it, too) has some degree of responsibility for determining the 
most appropriate goals for American education. But we have some unique 
tools and several potentially useful methodologies to offer. And w6 are 
reasonably committed to the "scientific approach" which, I feel, is badly 
needed to leaven the mixture of arm-cha^fe philosophy, common sense, and . 
vested interests with which this topi<i''is commonly addressed. 

One recurring suggestion has been that we concentrate on competency 
in adult life. David McClelland spoke to this point in his 1973 Ameri- 
can Psychologist article. This is not a. novel suggestion. Between 1915 
and 1919, the NEA Committee of the Economy of Time sought to identify 
what adults do, and what they need to know, and to use this information 
' in establishing goals for American' education. In my own mind,, I date the 
beginning of scientific curriculum-making by the work of that Committee. 
I commend its four-volume report and the McClelland article to your atten- 
tion. * ' 

In conclusicirfriet me quote from Dr. Earl AllUisi, formerly of the-; 
University of Lpuisville, and now of the University ^of Virginia. In a 
1967 article in the journal. Human Factors , he said: . . 

'"Performance assessment' is one of the most important and diffi- 
cult areas of current research. It is important in its own right, as any 
supervisor who has been called upon to justify the ratings of his workers 

can attest. 'It is important also because it is the crux of the 'criterion 
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problem* for so much other work; the final validation of selection and . 
'training techniques depends upon the assessment of the performance of men 
who have been differently selected and trained. The fioaT validation of 
an improved, human -engineered, man-machine' system depends upon it . , 
The assessment of man's behavior in the meaningful performance of com- 
plex tasks has challenged physiologists, engineers, and psychologists for 
many years.' The task has been recognized as a difficult one; the prob- 
lems have been formidable; and the solutions have been ephemeral . . . 
Consicl^bie quantities of good and respectable research have been pub- 
lished . . . (which) advanced science generally, but it ^ failed to 
provide any. significant progress towards performance assessment ..." 

If we come away from this conference having advanced the ball only 
a matter of inches, it will have been worthwhile. 

€ 
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SMALL GROUP DISCUSSION REPORTS ON PROBLEMS, 
ISSUES AND NEEDED RESEARCH AND DEVELOPMENT, 
IN APPLIED PERFORMANCE TESTING 



Administrator Group : Gerald H. Lunney, Reporter 

Like so many things, it's so easy to say yes to being a small group 
reporter but it is so hard to do. We had, I think, a very stimulating 
session. I'm just going to run through some random thoughts. 

Interestingly enough, for a group of administrators, we spent a 
good deal of time on two issues. The first one was cost. References 
were made to cost twenty times in our group. There were some interesting 
concerns relative to costs and to the whole question of APT» Cost is 
involved because when you measure behavior, you need people to conduct 
the measurement, and often they are not trained to adequately observe what 
is going on. One general issue that was raised was the problem of re- 
liability of graders. Along with that and other concerns that affect 
cost is the question of what criteria relate to good performance. How 
are you going to clarify good performance so that everybody knows what 
it is and when it has taken place. 

The other major issue for administrators was politics. Part of 
this concern came from the fact that a great deal of the interest ex- 
pressed in APT has come, as we have mentioned before, from external . 
agents. We got into another topic which I have noted here as "standard- 
ization plus and standardization minus. *i Standardization plus was con- 
cerned with the fact that if you had different people defining approp- 
riate performance, how do you arrive at an acceptable definition? How 
are we going to be able to say that if a student passes the test in City 
A and then moves to City B, that his performance is acceptable? 



standardization minus, regarding the- actual behaviors of people, raises 
personally culturally bound questions about performance. Since we 
have a diversity in cultures and we are talking about whether we can 
really establish some overall standards, how can we say that once a stu- 
dent has passed the tests he can perform adequately in different kinds 
of settings, interacting with different kinds of people? 

Another topic we discussed was who should set the standards of be- 
havior. How should they be established? For example, who is going to 
establish the standards for mathematics? Is it going to be the parents 
with a wide variety of needs or desires for their children? •'■Is it going 
to be the school? Is it going to be those people who will ultimately 
employ the students? We can't differ from the curriculum in applied 
performance testing. As we develop the curriculum in conjunction with 
applied performance testing we are talking about a single package and 
not different things. Someone raised a point about freezing the approp- 
riate- behavior in time— the question of whether current adult behavior 
is a sufficient standard for judging performance. 

The last point relates to the appropriate starting-point: Are we 
doing the same thing with applied performance testing that we did with 
•criterion-referenced testing? Are we trying to legitimize it by making 
sure it fits everything tha.t we learned when we took the first three 
courses in tests and measurements? Or, is it sufficiently different 
that we should perhaps hold back a little and make sure we have approp- 
riate testing gear. Can APT st^nd by itself and can't we legitimize it 
on that basis— and not on the basis of how well we can fit it to what we 
learned way back when. 
«^ 
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Curriculum Group ? Craig Gjerde, Reporter 

Since we were called a curriculum group we tended to get away from 
the testing and get into the planning of tests. We had an interesting, 
but hurried discussion of many. points which have aTready been discussed. 
One big question was. How do you define skills. needed by the student, 
especially in regard to adult life or the many styles of life those peo- 
ple might live as adults? We suggest that perhaps those life skills 
should^ot be defined by people in the traditional teaching disciplines. 
They should somehow be linked to survival education; we didn't define 
explicitly what we mean by "applied" performan^:^ testing, but I think we 
felt that the term' referred more to the long-term survival value of the 
education than to the current academic emphasis. 

Another question that the administrators should perhaps have con- 
sidered was. How do you handle differences in completion times that could 
occur if you get into applied performance testing? Do you allow students 
to leave high school when they are thirteen years old? There was also , 
some discussion about how long it might take to install applied perfor- 
mance testing j'n our educational system. Some people thought that it 

• might take us a long time to rethink our educational values^ and develop 
appropriate applied performance testing strategies. 

One basic issue that comes up over and over is. Who are the experts 
that are going to develop this applied curriculum? How are we going to 
somehow agree on what these survival value skills are? We have to rec- 

^ognize that there will be many political and social pressures that will 
resist the change. 

We identified some areas in which research and development efforts 
are needed: defining the kind of staff and facilities needed to imple- 
ment applied performance testing, and developing instructional modules ^ 
based on the several different approaches to organizing materials. 59 



4 . , 

Measurement and Evaluation Group I ; Sarah S. Knight, Reporter 

Being more or less measurement oriented we started with the eval-, 
uation guidelines for applied performance testing. The first thing we 
concluded was that the outline^form might_not be very helpful. It seem- 
ed overly complex; the wording could be simplified. We talked about the 
audience for these guidelines, and determined that they were written for 
the technician in tests and measurements. It was suggested that we might 
want to modify the definition of applied performance testing,, so we were 
not restricted to either simulated or real situations but could allow for 
inferential testing. 

We 'discussed test content with respect to minority groups, and 
there was some concern regarding possible over-emphasis on minority 
group content, which might actually restrict the kinds of things we could 
test for ar\d result in a test that could only be locally applied. It 
was suggested that we take a look at the EEOC guidelines (1970) for em- 
ployee selection procedures. 

Then we talked about problems. One problem concerned developing 
performance tests. Probably one of the more functional ways to develop 
such tests is to concentrate on the parts of a task that a person does 
incorrectly (i.e., to concentrate on the errors rather than the total 
content of the task). AIsq, the question was raised whether we should 
concentrate on process or product in terms of applied performance. We 
concluded, as the curriculum people did, that reliability was a crucial 
problem. 

In talking about applications in public schools, we got into a 
long discussion about what constituted basic skills and how we were 
going to test them. 
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Measurement and Evaluation Group II : Richard L. Stiles, Reporter 

Essentially, we discussed two questions: What do we test? and Do 
the tests have content validity? Under these headings we dealt with com- 
petency-based instruction and prerequisite performance skills. 

Also, we discussed who should set priorities in identifying impor- 
tant performance, and for what purpose. Should target performance beset 
at the school level, district level, county level, state level, or nation 
al level? Who is buying the can of dog food--the dog or the owner? 

What should we test? I think testing what the learner will be pre- 
disposed to perform is important. Thatn's, test what he will do in the 
future versus what he can do riow. I think in terms of accountability, 
we can say that the schools be responsible for current performance. We 
can't guarantee what learners will do when they go out into society. 

There is a problem in defining literacy. For example, do you need 
inferential skills to be considered functionally literate? Again, in 
applying applied performa^nce testing i.n public schools at lower educa- 
tional levels, if you have trouble deciding at the high level what con- 
stitutes basic skills,' you will not be able to identify prerequisites 
We are going to have to do a little backtracking. 

With respect to issues in public schools, to what extent are the 
public schools responsible for those identified skills? That is, many 
performances might not actually be a part of what public schools now be- 
lieve they should be doing. 

In terms of research efforts, it was suggested that we need more 
emphasis on research regarding naturalistic observation. We need to con 
sider doing longitudinal studies on performance testing. We need to 
identify sitfuations in which applied performance testing is appropriate. 



DISCUSSIONS OF PROBLEMS, 
ISSUES, AND NEEDED RESEARCH 
AND DEVELOPMENT IN APPLIED 
PERFORMANCE TESTING 



Joseph L. Boyd, Jr. 
' Educational Testing Service 

One of my major concerns regarding the development and use of per- 
formance tests, about which I had intended to speak tonight, has been 
addressed in the paper ^'Criterion Guidelines for Evaluation of Applied 
^Performance Test Materials and Procedures." In presenting the paper for 
* discussion, CAPT has taken a very important step toward increasing the 
development and use of performance tests in schools. Congratulations to 
CAPT--even though they stole my thunderl 

In emphasizing the development of testing instruments involving- 
real-life simulations we must not lose sight of the fact that some paper 
and pencil tests can require complex performance and are, in that sense, 
a kind of "performance" test. Examples include the "patient management" 
problems of the National Board of Medical Examiners tests, and several ^ 
nursing special i1^ certification programs. These tests are modern vari- 
ants of the "tab" test, whereby the examinee makes a response choice and 
obtains additional 'information. These tests are variously referred to 
as progranped tests, or variable sequence tests. The ultimate failure 
on such a test occurs when a physician erases his final choice in a ser- 
ies of choices and gets the additional information: "patient expiredl" 

This type of test has also been developed to assess other kinds of 
diagnostic skills. British radio repairmen take a programmed test as 
part of thp procedure for occupational licensing. The number of examin- 
ees had made the trouble-shooting test with real radios an unmanageable 

^HuBbard, John P., "Programmed Testing in the Examinations of the Nation- 
al Board of Medical Examiners." Proceedings of the 1963 Invitational 
Conference of Testing Problems. 

^ be 

65 



task. Motor vehicles, guided missile electronic and hydraulic systems, 
radar, and television fault diagnosis tests are also being used. On a 
small scale, programmed. testing can be done with a response on one side 
of a card, and the added information on the other. 

I would like to make another observation--regarding standardization 
of tests. I speak not of the statistical treatment of test results, but 
of specifying to the examinee exactly what he or she is to do, and ob- 
serving and grading the performance, or product, or both in a systematic, 
predetermined manner. To me, this is performance testing. I recently 
reviewed a paper in which the author failed to differentiate between 
standardized testing such as I have defined it, and observation of un- 
specified, undirected student behavior. The latter activity could never 
be construed as "performance testing," as I see it. 

I appreciate the opportunity I've had today to hear and be heard. 
CAPT and NCME have done a great service to education in hosting this 
meeting. I am taking away from this meeting much more than I brought- 
Thank you. 
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Hulda Grobman 
University of Illinois College of Medicine 

» *■ * 

Aspects of performance testS' that may be noteworthy, in addition 
to those mentioned by Mr. Lavisky: '* 

1. It is far more difficult to standardize test administration con- 
ditions for performance tests than for conventional tests insofar as the 
critical (significant) variables are concerned. What is supposed to 
happen in the way of environment and process, what is reported to have 
happened, and what actually did happen may not be entirely congruent. 
Non-events— things that are supposed to happen but did not— may be fre- 
quent. Feedback from examinees concerning test administration conditions 
may be one way of checking on gross omissions -or commissions. 

2. What is a satisfactory correlation between test performance and 
actual on-the-job performance? A correlation appropriate for one puf- 
pose may not be appropriate for another. Thus, a correlation of .65 or 
.70 between performance test and on-the-job performance for a job that 
is relatively -closely supervised or non-critical (in terms of cost of 
error— material and human) may 6e appropriate. But a similar correlation 
may be inappropriate in an area where error is critical (surgery, or 
navigation of a plane). Also, it should be kept in mind th^t correlation 
does not imply a cause-effect relationship. A correlation between a per- 
formance test and on-the-job performance may reflect a third variable 
which may not always be present, and so the correlation may change un- 
expectedly. , 

•^Presently Professor of Health Education, St. Louis University Medical 
Center 



3. The selection- of tasks for perfonnance tests (the portion of the 
universe to be selected), and the repitition o.f tasks present problems. 
Reliability in the test-retest context presents problems since human 
performance is not necessarily reliable. People have good days and bad 
days. How many times should a task or kir>d of task be repeated to val- 
idly assess its mastery? And. tasks within a job may be unrelated in 
terms of mastery, so that internal consistency measures may be inappro- 
priate. * • 

4. Though content validity might appear to be self-evident for per- 
formance tests, such validity may not exist. The tasks to be performed 
may not, in fact, be a necessary component or standard. 

5. *" Scoring mechanisms for performance tests require more systematic 
concern than may be self-evident. Scoring is probably a more complex 
concern than is the case with conventional . tests. In addition to the 
question of obtaining reliable scoring, is the question of whether the 
examinee should be permitted to continue a test after committing a ser- 
ious error at some point before completion. Allowing him to continue may 
waste resources and endanger the examinee or a subject he is interacting 
with. What are go/no-go points? What are valid criteria for establ ish- 
ing these? How should elements of the exam be weighted? Are all equal? 
Are some absolute requisites and other desirable non-requisi teS? If 
passing is>-based on total score, we may pass a student who ruins his ma- 
chine or kills his patient. 

6. As in more conventional testing, the performance test may require 
modification to reflect whether it is for diagnostic and/or certifying 
purposes. For certifying purposes, a product may be all that is needed 
to judge adequacy of performance; for diagnostic purposes, the product 
alone may provide sufficient data. ^ 
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7. Because the format of performance tests will be a new experience 
for many examinees, there should be prior expla-nation of the format and, 
if at all possible, a practice dry run using the performance format. 
Without such an advance tryout, the test may be one of the examinee's 
adaptability or testwiseness rather than of his ability to carry out a 
specif ied^job. 

8. Even teachers whp have been observing performance for many years 
may not be accurate observers. Observing is a learned skill; it requires 
practice and uniform criteria. Such uniformity of interpretation, wheth- 
er of product or process, cannot be a^ssumed. Performance tests lend 
themselves to the halo effect at least as readily as essay tests, the ^ 
grading of which is notoriously unreTiefble. Like essay tests, perfor- 
mance tests seem simple to construct, and this may be the case. However, 
the scale for judging performance is not. And neither is it simple to 
achieve congruence among raters or for one rater over time. Without 
such congruence, the test is invalid. 

9. Performance testing is admittedly expensive, far more so than con- 
ventional paper-and-pencil testing, in terms of the facilities and the 
time required of examinee and examiner. However, failure to use perfor- 
mance tests may be still more expensi ver-though the cost may not be as 
obvious. 

Areas that might be explored by CAPT in the coming year: 
1. Some refinement and elaboration of the Guidelines for the Evalua- 
tion of Applied Performance Test Materials and Procedures is needed. It 
is hoped that a document would be produced comparable to the APA St^^n- 
dards for Educational and Psychological Tests— keeping firmly In mind the 
-differences between performance tests and conventional tests, and the 
fact that many performance tests are criterion-referenced rather than 



norm-referenced. Thus, the Guidelines should not be bound by conven- 
tional test practices and standards where these are inappropriate to the 
purposes or format of performance tests. 

2., Preparation of a ^basic text for the classroom teacher on how to 
write--and how not to write—performance tests is desirable. An anno- 
tated bibliography is not suffi-cient, since much of what has been'written 
to date about performance testing is buried in materials concerning con- 
ventional testing.. And there is probably no existing how-to-do-it source 
to prepare a performance test writer appropriately or efficiently. A 
second,' more sophisti tated and detailed text for the test specialist, 
covering preparation, use, and interpretation'of results of performance - 
testing is also needed. ^ / 

3. It would be useful to. have a section of the CAPT Newsletter or a 
comparable medium devoted to- eXchati^e ideas about performance testing; 
a publication similar to the UCLA ^Evaluation Comments might be appro- 
priate. • * ^ 

4. Some investigation should be undertaken concerning various. legal 

aspects of performance testing.. Two aspects requiring early attention 

\ 

come to mind: * 

The first concerns, records of performance. In conventional test- 
ing, if there is some question regarding the accuracy of test scoring, 

one can return to. the answer sheets to verify the scoring; for certi- 

■ ' t 

fying tests, it is conventional practice to retain answer sheets for 
some time in the event that questions concerning accuracy of scoring 
arise. For performance tes^ts in which the test is of process, the only 
record is the examiner's recording sheet. If his accuracy or objectiv- 
ity is questioned, will additional evidence be needed to verify or jus- 
tify the score? For certifying examinations this could become a critical 
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issue. Second, an 'important aspect of performance is how an individual" 
operates in crisis situations. However.'to subject an examinee to per- 
rormanc6 tests simulating severe stress may/ however realistic, be in- 
appropriate 'and highly unacceptable. To what. extent can/should cri^ses 
be incorporated into testing? .And what options are more acceptable?. 
5. Some types of performance tests nee^ not be kept secure. For ex- 
ample, a checklist for a given job or product in effect provides the ob- 
jectives a-nd a learning resource as well as the rating system for the 
performance test. However, for problem-soTving skills, if the problem ^ 
is to be a new one to the examinee, so that he. can demonstrate problem-' 
solving ability, the test must remain secure!. 

It would be useful, given the expense of developing tests, to have 

V 

a mechanism for sharing secure tests while still maTnta<ljr[ng security. 
The design and imp^lementation of such a system would De^ a major contri^' 

* ^ 

bution to the field of performance testing. 
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Dr. Ruth Nickse 
Syracuse University 



I feel if this meetihg had been held a year ago I would not have 
ventured forth where angels, fear to tread. I had never heard of applied 
performance testing, and in fact, I didn't know that was what I was doing - 
until I received the material from CAPT not too long ago. I will tell 
you what my charge was and what I did and then you can throw your rocks 
and stones because we certainly -haver in our effort to do something dif- 
ferent in testing, thrown out the baby with the bath water. 

When I joined the group in Syracuse my charge was to design an 
assessment system that would provide an opportunity for adults to demon- 
strate what they knew and could do regardless of where they had learned 
it, in certain required areas such as computation, communication and 
life skills; and secondly, to grant a regular high school diploma to 
ratify this learning-. What we did was to develop a new kind of testing 
program based on some assumptions we had about adult learners. Our pro- 
gram is' full of assumptions. We feel very comfortable with them right 
now, but of course they are very questionable. We assumed that adult 
learners were test anxious having come through the American school sys- 
tem; that they were rebellious because they, had been subjected to GED ex- 
ams when they were perfectly fine auto mechanics; and some did not care 
about— or get the proper answer about— the amount of wood pulp in the 
State of Oregon in i922. We figured that they were busy with full-time 
jobs and families and had little time to sit around testing rooms. We 
figured that they were highly motivated to work for a high school diploma 
after many years out of school. We figured that because they were adults 



they could be responsible for their learning and testing situation. And, 
we figured that they needed an opportunity to choose assessment molds 
which would best enable them to present their skills and competencies. 
Above all, and this was our biggest assumption~we felt that they were 
competent in life skills by virtue of having lived and worked in the 
community. 

If you start with these assumptions, you are free to design what 
we call an open assessment system. But you have to go along with all of 
these assumptions. We decided that we didn't want to created another 
GED. Many of the persons we hoped to reach as a target population had 
had sad experiences with the GED ^nd other 'kinds of standardised tests. 
Since we were free to dream wild and big, we did. Our objectives were 
to design an assessment system responsive to adult learners, to give 
learners some control over the testing environment, to make the assess- 
ment process a learning experience, to relate assessment form and con- 
tent to the concerns of adults and to make the testing process humane 
insofar as we could do it. 

In order to reach the objective of giving learners control over 
• the testing environment, we designed diagnostic and final assessment 
instruments and f-ocesses that are initiated on demand and are self-paced. 
In our system, learners have several assessment options. In order to 
make the assessment process a learning process we have told the learners 
in advance the 64 competencies that they will be required to demonstrate. 
Throughout the program we keep them thoroughly informed of their progress 
in demonstrating the 64 competencies. 

^Some of the tasks that ye all face as adults are changing residence, 
finding a place to live, finding a job, developing, consumer awareness 
and maintaining personal healtti. We used simulations and we used oral 
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interviews as part of our procedures. In order to make the testing pro- 
cess humane, we went to individualized testing. We helped the learners 
assume responsibility for their own progress because they initiated the 
request for testing. We give them continuous feedback and success exper- 
iences because we hope to design'this on their strengths and not on their 
weaknesses. Too often, I think, testing makes it easier for us test de- 
signers to work on error; it doesn*t make the person being tested feel 
so good. 

The distinctive features of our external diploma assessment process 
are these; we always talk about the good conditions for testing. Well, 
the temperature has to be right, the light has to be right, the distrac- 
tion level has to be dQwn, and so on. The best place for that is at 
home, so we designed flexibility in time and location of testing by al- 
lowing the adults the opportunity to establish some of the ^^rondi tions of 
testing by taking three of the tests that purport lo assess the 64 com- 
petencies at home. They can take the tests in our office, but if itjs 
more comfortable at home, they can do it at home. After all, if you are 
working on two jobs, the time you have for testing is pretty short. Two 
of the tests are oral, because some people do not do well when they have 
to write, but they do speak well. If it is a matter of health or re- 
lated health competencies for yourself and family it is- just as valid, 
I think, to discuss these kinds of things as to write them down or to 
choose the correct multipl-e choiqe answer. 

We felt it was important to have open information on the require- 
ments so the competencies would be explicit and^open to discussion. 
Learners are given a copy of these competencies to take home. As a mat- 
ter of fact, as one of the diagnostic instruments we have a self-rating 

checklist.' It is amazing to find out that adults realfze what they don't 
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know and are willing to mark that down as a' weak area as long as there 
is no penalty attached. 

have flexibility in scoring. There are some right-answer ques- 
tions in math. Of course, you have to be 100 percent right. It's like 
in the old days when mother said that you had to eat everything on your 
plate. In our system math is one of those things. You must demonstrate 
it and you must demonstrate it 100 percent. Some of the answers "to our 
questions are merely documentation. If you* wish to ask the question, 
"What is evidence of having participated in "the community as a responsi- 
ble voter?", one of the things that you might document is whether the 
person has a Voter's registration card. 

We have continuous feedback after each of the take-liome tasks. 
There is a spotcheck in which some of" the most vital competencies are 

tested in the office again. We know that the wife could give a little 

/ ■ - 

/ 

help to the husband ^viho is studying for his external diploma, so we do 
ask them to^deiriin^^rate some of th.e critical competencies back at the 
office. Mh^y d^fi't object to that, as a matter of fact. 

We offer;', of course, the first competency-based diplomas in the 
country. We,' have scooped Oregon which' ia going to be giving diplomas in 
1978 for d^onstrating life skills competencies. We feel this is an im- 
portant direction for adult education; the implications for secondary 
and elementary school curriculum I will leave to those persons who are 
involved in this at the state level. However, I want to draw your atten- 
tion to the work of Dr. Norvelle Northcutt in Austin, Texas, who has part- 
ly answered the question about what adults need to know to function in 
our society. The results of his study, which identifies some 75 compe- 
tencies that adults probably need to function successfully, will be out 

in December of this year. His national survey of adult competencies will 
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.be shocking to some and not surprising at all to others. But, if you 
are corrcerned about validation of competencies, his work offers a valu- 
able resource for you. 

In our work we have been confronted with several kinds of problems 
and we request your help on these\ I don't think any of them are new« 
One of the problems is that we, do not have a process for good task analy- 
sis. Since competencies are not leveled in our system and since adults, 
need to read passages on many different levels—from road signs to the 
domain of leases— we need a process for good task analysis. 

Marilyn Li.chtman*s reading test is probably the first one that I 
have seen that confronts the nine or ten different domains of reading in 
which adults must achieve in order to be successftJl in their daily lives. 
It has been the most useful standardized test booklet that we can find. 
If you are interested in such a test, you should look it up. It is a 

r' * 

self-paced, self-initiated test which adults, in my experience, are - 
pleased to take because they find it relevant to their needs and inter- 
ests. But, of course, it doesn't break down reading tasks into small 
prerequisite skills. We do need a method of breaking down competencies 
into prerequisite skills; that is an enormous job. 

We need some criterion samples. We need a behavior analysis of 
what constitutes good or poor performance. And then we have to ask our- 
selves, "Why are we asking ourselves that?" I think there is a tremen- 
dous amount of value judgment in education, notably in the field of ap- 
plied performance testing. Our 64 competencies were selected by a task 
force of persons who probably came from similar backgrounds and valued 
the same competencies. Whether those competencies are truly representa- 
tive-of what all people in our adult society need to know Tm not sure. 
We should remember that we make value judgments each time, we select a 
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group of competencies and then decide to legislate them, or by fiat, 
label them a curriculum. 

We must empirically evaluate what people really do on their jobs 
and in their lives. What one thinks they do and what they actually do may 
be different. Testing should correspond to reality. 5ome behaviors are 
probably common across certain occupational fields, but until we do em- 
pirically validate those things I think we have to hold ourselves in 
check and realize that we are making value judgments all the time. 

We really do need to review competencies at frequent intervals. If 
we are to say- that they represent what adults in this society need to 
know, I think they might need changing every month— or at least every 
year .or two.. Those of us involved in the explication and testing of 
•competencies must realize that we are in for the long haul and budget 
some money for regular reviewing. I think it is very important that we 
have good behavioral objectives with precise criteria defined. No mat- 
ter how we do it we can't afford that step, because as Bill Osborn said, 
"out of the objective comes the testing." But the questions behind that 
are. Why ttlis ^particular item? Why that competency? Who values it, and 
who sets up the criteria? Those are big questions. 

t won't add any more to my plea except that I see applied perfor- 
mance testing as a chance to humanize assessment. Of course, that works 
in exactly the opposite way from cost accounting and our concern with 
group tests. My concern has' been with the adults who take the test. The 
beautiful test is a wonderful thing and I value it too, but the persons 
taking the test are equally valuable and their needs as test takers 
should be considered. 

I picked up a quotation relevant to my work from someone who has 
written a little book on how to conduct oral interviews. "We cannot 
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humanize assessment without taking risks of abuse. The problem is to 
preserve humanity while enhancing validity." If CAPT can tell me how to 
do that, then it has been organized for a good" reason. 
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William C. Osborn 
Human Resources Research Organization 

♦ 

As one who has been involved for several years in the development * 
of performance tests— chiefly in connection with Army training evalua- 
tion— I, see the problems of performance-based measurement in the field 
of educational evaluation to be essentially unchanged during that time. 
Most are practical problems encountered in trying to provide what might 
be termed efficient tests— that is, tests which are valid and reliable, 
but also usable in the sensd of evaluating the proficiency of large num- 
bers of people at minimum cost in time and resources. Achieving 'a bal- 
ance between test quality and administration economy lies at the heart of 
the performance testing problem. 

Although performance tests have other purposes, they are used 
chiefly in evaluating training and educational outcomes. Following 
training on a job- or life-task, a student is normally required to dem- 
onstrate proficiency on that task before being advanced to the next stage 
of learning, or ultimately, out of school and into tho world of work. 
The ^development and use of such tests would seem to be straightforward: 
the job- or life-relevant conditions for task performance are specified 
and an acceptable criterion of performance defined. The student's per- 
formance is then evaluated according to the established criterion, Un- 
for,tunately, the nature of certain job- or life-tasks, together with 
time and cost constraints, oft^n create problems for the test developer. 
In circumventing these problems he may resort to simplistic test proce- 
dures of questionable reliability or validity. The seriousness of this 
problem is reflected in the fact that such comprises very frequently 



occur—apparently either because of inadequate regard for the price one 

pays in diminishing reliability and validity, or because developers are 

not aware of alternate approaches. 

This evening I would like to summarize briefly four aspects of 

performance test development that I consider essential to the practical 

achievement of reliable and valid measures. Please bear in mind that my 

observations will be limited to test development for individual tasks 

and will not touch on other aspects of reliability and validity— such as 

sampling of the job task domain or replications of test performance-- 

which pertain to testing on an aggregate of tasks or an entire job. 

♦ 

Test Method' 

The first critical aspect of a performance test to be considered 
pertains to the directness or relevance of what I will call the method 
of testing. A test method is relevant or direct if it requires perfor- 
mance identical to that specified in the actual job- or life-task. The 
scope and fidelity of actual job Dr life conditions presented and the 
realism of the response medium u?ed determine the directness of the test- 
ing method. . ' , ^ 
t In a training or other, performance assessment s'etting, Ijmited 
resources often prevent a direct task enactment method of testing. In- 
direct methods, involving partial task performance .or simulation of task 
conditions are often lised. Such methods commonly measure performance 
only on the more testable part of the task. Paper-and-penci1 knowledge" 
tests on tasks requiring both. knowledge and skill represent the most 
flagrant example of indirect testing. Tests of job knowledge are rela- 
tively Inexpensive and have e;kceptional psychometric properties. Yet, 
for obvious reasons, we would never consider l.icensing a man to fly a 



plane or drive a car merely on the basis of a knowledge test. But why 
then, in other job or job task areas, do we tend to accept knowledge as 
3 valid measure of performance capability? The chief reason is cost. A 
performance test presents the real work environment with all its cues, ' 
then elicits actual job behavior, as directly as possible. But represen- 
tation of the real world is expensive. Educational and personnel admin- 
istrators tend to think performance tests require too much in the way of 
equipment, personnel arid time to justify their use. To insist, however, 
that a test of job knowledge is the only alternative reflects a false 

dilemma. ^ • 

For any given job task several alternative testing methods are 
available. These will run the gamut from an expensive but fully relevant 
performance test to a relatively ine)4pensive but marginally valid knowl- 
edge test. Elsewhere, I have described an approach to devising alter- 
nate test methods, based on the concepts of simulation and task-element 
"sampling. I have collectively termed such measures Synthetic Perfor- 
mance Tests. ^ The intention is to connote a process of synthesis by 
which the substructure of a job task becomes the basis for selectively 
constructing alternative forms of a test, each representing (at least 
theoretically) a more or less optimal blend of validity and feasibility. 
In some cases this optimal blend may be achieved through simulation^; 
that is, by substitution stimuli in either the task display or the sur- 
round, or by requiring a substitute response. In other cases, perfor- 
mance may be efficiently" measured by testing on a subset of task ele- 
ments, regardless of whether simulation is used. Thus, synthetically 
generated alternatives to fully relevant performance tests may vary in 
two major dimensions: fidelity and scope. 
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Consider, for example, an electronic troubleshooting task. Know- 
ing the correct" test sequence for isolating a faulty equipment component 
is only part of the task. Among other task elements the troubleshooter 
must also be able to place the test-set in operation, establish a good 
connect-ion at the test points, and corrSctly interpret the test readouts. 
Can this type of job task be adequately—that is, va1 idly— tested with a 
tra^ffional verbally for*matted test of job knowledge? I would say no. 
In Ifact, experience may reveal that, on the job, a frequent case of 
faiiity troubleshooting is the inability of the trpubleshooter to estab- ' 
l)ish good connections at the test points— a^n, essentially physical or 
manipulative element in the task perfomance. So, assuming the test de- 
veloper cannot afford the luxury of a direct, hands-on method of testing, 
the important thing is that he does not immediately revert to the typical 
knowledge test. He should use his inventiveness in devising alternative 
testing methods that call for demonstrated behavior as similar as pos-^ 
sible'to that required ip task performance. PictoriaVt- graphic, or even 
low cost three dimensional simulators should be considered. The develop- 
er may assess the relevance of these synthetic options by checking the 
breadth and criticality of tasj< elements measured by a ^particular met|hod. 

Only in this way, it seems to me, can test developers arrive at ec- 
onomical methods of proficiency testing while maintaining an acceptable 
level of content validity. 

Test Criterion 

Let me turn now to a second dimension of performance tests— that 
of test criterion. All tasks have both a product (outcome) and process 
(steps in task performance). Product measurement is, however, of over 
riding importance in certifying performance on a task; failure to include 
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product measurement as the principal criterion may severly limit test 
validity. Although it may safely be said that every task has a purpose, 
in practice a great many performance tests employ process measurement 
only in evaluating a person's readiness to perform outside the class- ^ _ 
room. ' 

Before looking more closely at why process measures are widely 
"substituted for measures of task product; we must consider three types 
of tasks. First, there are tasks- in which the product and the process 
are the same— that is, the product Is a process. These tasks are few, 
and normally serve an aesthetic purpose; examples include springboard 
diving, dancing, playing a musical composition. Here we see that the 
product of the task is more or less the correct execution of steps in 
task performance— that is, the process. Second, there are tasks inwhich 
the product necessarily follows from the process. J^ixed procedure tasks 
typically fall in this category. Troubleshooting an electrical circuit, 
balancing a checkbook, and changing a tire are examples. In such tasks 
the procedural steps are known and observable, and comprise the necessary 
and sufficient conditions for task outcome; if the process is correctly 
executed, task product necessarily follows. 

For these first two types of tasks it is not particularly import- 
ant whether process or product measurement is used. But for a third 
type, it is very important. This is the type in which the product is not 
fully predictable from the process--either because we cannot specify all 
the necessary and sufficient steps in task performance, or because we 
cannot or do not accurately measure them. In spite of the obvious im- 
portance of product measurement for tasks in this latter category, in 
practice performance tests often do not focus on product. And the rea- 
sons generally stem from practical considerations in which the measurement 
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of task product is viewed as too costly, too dangerous, or tgo impracti- 
cai; For example, in a first aid task involving controlling tHe bleed- 
ing from an external wound, the test developer would probably be limtted 
to requiring demonstration of task process; observation of the actualN 
task product—restriction of -blood flow—would probably not be possible, 
for obvious reasons. Othei) situations are less obvious, however. If 
any of you are involved in instructor training, you may have observed 
that a student instructor is evaluated on the basis of such process fac- 
tors as "had a. well organized lesson plan," "used visual-aids effec- 
tively," "had good eye contact," "had goodie voice projectidn," "covered , 
all points in the lesson plan," and so on. Although the product of in- 
struction is cleatrly student learning, it is seldom if ever used as the 
criterioYi for qualifying an instructor—probably because it would involve 
a more time consuming method of evaluation. 

Tm su^ we could all testify, to other instances in which product 
measurement is not used. Some instances are justified by cost or safety 
considerations; others are not. It seems to me that test developers of- 
, ten fail to see the importance when faced with practical limitations. 
The overriding question that a test developer should ask himself in this 
situation- is, "If I use only a process measure to test a person's 
achievement on a task, how accurately can I predict on the basis of this 
process score whether the person would also be able to effect the pro- 
duct or outcome of the task?" Where the degree of accuracy is substan- 
tially less than that to.be expected from normal measurement error, the 
test designer should pause and reconsider how time 'and resource limita- 
tions might be comprised to achieve at least an approximation of product 
measurement. p j^- 
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Test Conditions ' y 

'•*^Now, ^let/s look at a third dimension of performance tests: stan- 
dardization of conditions under which a test is admin^'stered. This is 
ar\ important step in achieving test reliability. Indeed, standardized 
conditions constitute the very essence of any proficiency measure which 
professes to be a test. Because this requirement ..is familiar to test 

develooers, it is seldom violated. Most developers make an effort to 

' J 
maintain test instructions, materials, tools and other environmental 

factors as nearly constant as possible from one test administration to ^ 
the next. However, I would like to call to your attention to one particu- 
lar class of tasks which is particularly troublesome in this regard: 
tasks involving interpersonal behavior. In such situations, a person or 
group of ^persons represents an important part of the environment to be 
controlled, or standardized from one test administration to the next. 
Sample situations include counseling, salesmanship, personnel management, 
or something like hand-to-hand combat. People are part of the task rele- 
vant conditions in each of these areas, and obviously people are differ- 
ent to standardize. If you wanted to assess a policeman's ability to 
properly subdue an unarmed but hostile suspeqt, what would your perfor- 
mance test be like? How would you inkure that test conditions were 
standardized over all policemen to be tested? The same questions might 

be asked about assessing a sjpervisor's ability to persuade a worker to 

\ 

perform some difficult or unpleasant task. 

Unfortunately, I know of no easy solution to this problem. Test 
designers should consider greater use of the well trained, "standardized 
other."'' And, here, greater effort should be made to avoid settling too 
quickly for some probably irrelevant measure of task process. 
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Test Scoring 

The fourth aspect of performance tests I wish to address is test 
scoring. Scoring protocols primarily affect reliability, but if grossly 
mishandled in test design— as I will point out in a moment— they may also 
jeopardize test validity. Scoring procedures involve translating an ob- 
served test outcome into an objective pass-fail score. Such procedures 
should be structured so that only the more reliable perceptual skills 
are used; that is, the scoring activity should be reduced to one of 
matching or comparing the test response with some model of correct re- 
sponse. Unfortunately, in many test situations responses seemingly can- 
not be judged in this "either-or" fashion, but require a "more-or-less" 
type of judgment. When this occurs the test developer should not (as is 
sometimes done) compromise by using a test method that yields a more 
measurable outcome because test validity may suffer. Rather, he should 
strive ^to break the task-relevant response down into elements, so that 
a scorer can more easily make comparative judgments. Typical programs 
of knowledge testing provide a familiar illustration. The pervasive 
multiple-choice test yields responses which can be scored with maximum 
reliability. Scorers obviously have little difficulty in matching a 
selected response alternative with that which is keyed as correct by the 
test developer. The scoring of essay tests, on the other hapd, has tra- 
ditionally presented reliability problems. Yet despite the .scoring prob- 
lems .inherent in essay testing, a competent test developer ^ould not re- 
sort to multiple-choice testing on knowledge tasks demandin,^ recall or 
generation of material merely to achieve greater scorer reliability. 
Normally, he would provide a model response in the form of an exhaustive 
list of the critical elements of an acceptable essay response. The 



presence of such elements could then be judged with relative objectivity 
by a qualified and earnest scorer. 

This same thijiking applies to the development of scoring protocols 
for performance tests if these tests are to produce refiable results. 
The subjectivity with which many task performances are customarily 
scored could be substantially reduced^ it seems to me, through wider use 
of what may be termed scoring templates. Where the model response on a 
test of markimanship is defined as a hole in the bullseye^ it is rela- 
tively easy for the scorer to judge the acceptability of the response 
made by the rifleman. The concentric circles normally ma/^ked on a tar- 
get act as a kind of simple template which enhances the ease and objec- 
tivity of scorer judgments. Templates could be applied equally well in 
scoring other tests. For example, tasks mentioned earlier in which the 
outcome is a process are often difficult to assess reliably. It would 
appear that performances such as springboard diving or gymnastic exer- 
cises could be more objectively scored if the outcomes were filmed an)l ^ 
figural templates over! ayed on key frames to assess the performer's ac-f 
curacy at those critical points. Similarly, in evaluating the perfor- 
mance of a music student, recordings of selected renditions could be 
analyzed at the scorer's leisure--perhaps with the aid of auditory "tem- 
plates'' such as a metronome to measure beat or comparative tones to as- 
sess tonal quality. For these particular tasks— or for^that matter, any 
task in which the product is transient— the added cost in recording the 
product for later scoring would probably be offset by savings in scoring 
costs; that is, the more objective approach to scoring would very likely 
preclude the usual requirements for a panel of expert evaluators. But 
more important, the scorer would not be constrained by real time, and 



^ cQuTd function at a place and time and i^e of his or her choosing, 

^ using prepared templates to increase objectivity. 

These four factors— directness of test method, type of performance^ 
criterion, standardization of conditions, and objectivity of scoring- 
must be the focus of further research* and creative development work if 
performance tests are to be used validly and reliably. 
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Prefatory Note 



These papers were presented at the First National 
Annual Training in Business and Industry Conference of 
New York University, held in New York Gity in March 
1972. The first paper, *If It Exists, It Can Be Measured'— 
But How?'" was prepared by Dr. Eugene A. Cogan, who is 
Director for Research Design and Reporting in the 
Executive Office of the Human Resources Research 
Organization (HumRRO) in Alexandria, Virginia. Th^e 
second paper., **Measuring Effectiveness: Quality Control of 
Training," was prepared by Dr. J. Daniel Lyons, who is 
Director of HumRRO Division No. 1 (System' Operations), 
also located in Alexandria, Virginia. ^ 
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"IF IT EXISTS, IT CAN BE MEASURED"-BUT HOW? 

♦ Eugene A. Cogan 

Psychologists— including' those especially interested in measurement— have , been, and j 
continue to be, plagued by elusive and fragile concepts. Many concepts have their origin 
in the individual and cultural experiences all people share. For example, we all have 4.he 
feeling that we know some^ people who seem ''smarter" than, others over and above 
difference in their schooling or other educational experience; this feeling -has led to the 
concept olSJ'intelligence" and to attempts to define, understand, and measure intolligoncc. / 
Our shared Experiences have led us to feel that some people, are better employees than 
other^; this Xeeling has led to attempts to define, understand, and measure "goodness as . 
an employee." Attempts to cope with "goodness as an employj&e" have been. equally as 
frustrating to employers and to psychologists as have been attempts to make sense of 
"what is intelligence all about." ' ^ ' 

The main- stem'^of the title of my paper— "If it exists, it cari be measured"^'-- is a fr(»c 
i.ianslation of a classic statement by Edward Thorndike who was trying to counter tin- 
pursuit of poorly defined pseudp-concepts that bordered on being personal illusions. For 
us, Thorndike's message i^: "Until you can define what you are interested in well enougl> » 
so that you can figure put how it can be measured, it can mean anything and, therefbn*; 
it means nothing." . " ' ^ ^ 

The i^hallenge of Thorndike's propositipn to theoretical psychology has no easy 
answer because thecTretical psychology Vs concerned with generally important abstractions 
regarding human behavior. There is an understandable reluctance to fix on forinai 
definitions for concepts because useful definiticois must be restricti4«2 and omit thin^'js: 
theoretical psychologists are reluctant to risk throwing out a baby with the bath wate;. 

However, for practical, applied measurement the implications of Thurnclikt^'s 
doclrine are very useful. In a practical setting, Thorndike's edict translates to: '\)( cour:.c 
you can measure it, after you have defined^ what it is." *The main purpose of my 
{presentation will be to -deal with how to go about defining "it" so tliat ybu can proc*e(Hi 
to measurement,- and then how to evaluate 'the measurement. 

In any practical setting, there.are many situation-specific features and these provide* 
a key to measurement. The trick to translating an impression into a m^^a.suiahlo soiin* ^ 
thing consists of using the situation to define what measurement isjieeded. 

Purpose of Measurement * * 

Foremost for defining measurement is "w^iy." In selecting or devising a measure- 
ment, it .is essential to decide or determine the purpose of the measurement.- In industry, 
the purpose translates to decisions that management or personnel people must make. Who 
will (decide what' with the aid of measurement information? 

It is not enough to stop analysis of purpose at the broad levels of selection, 
assignfnent, promotion, training evaluation, or personnel) evaluation: Eych of tPiesc 
mcludes so many variants that depend on particular purposes that the cal^egory is the 
beginning, not the end, of analysis, if concern is with selection, Ihe proper mea.su re men I 
''depends* on whether selection is «fpr- training or for direct job assigriment, whether 
concern is solely for competence in an entry job or also with potential for Advancement, 
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whether the work setting is closely supervised or relies ^on self-supervision, whether the 
work setting requires team work or individual work, and so on. ^ 

Even what seems to be a specialized and highly specific pu/{)ose like quality control 
of training, as is shown in Dr. Lyons' paper, involves at least four distinctive purposes 
and each of these has its own distinct definition and measurement. 

What is Measured 

In a particular setting, with purpose established in terms of the particular .decisions 
that, are to be made, the second element in defining the measurement corlcelrns what is to 
be measured. Much of the definition of ''what" will already have been established in 
careful definition of purpose. That is, if the pjurpose concerns selection for a training*' 
program preceding assignment to a job, the -^what" should not contain very many, if any, ^ 
direct indications of • job knowledges and skills, but rather should deal with ability to . 
learn these knowledges' and skills. On the other l^and, if selection is for direct assignment 
to job duties, it is whether these have been previously learned that is pertinent. ' ^ * 

The matter of what is- to be measured has been, by far, the subject of most cpncern 
and debate in industry and among measurement specialists. Primarily., this is' because 
dollars and time for measurelnefft, 'cost elements that are vei?y sensitive in industry,* are 
heavily dependent on what is measured. For example,^ considering * job performance , 
evaluation, the best theoretical measure is unobtrusive,, scientific observation and careful 
measurement of*behavior» over a fong period of time, in the actual job setting. While such 
measurement is technologically possible, it would be so prohibitively expensive that less 
costly alternatives are always being sought and, typically, used. However, these' less costly 
methods do not measure the same thing! 

Usually considered dosest to scientific observation in the natural setting is a job 
sample test.^ Even assuming that sampling of the job pSrforipances is well done, job < 
sampl^ simulation is not the same as job observation because important contextual and 
personal elements cannot be simulated. That is, a test environment creates test perform- 
ance for the individual. He may try much harder than he does in the natural setting, or " 
he may be immobilized by test anxiety. ' 

jUess costly— and hence more common than job 'sample simulation tests-"-are analytic 
tests of job performance elements. Such tests measure component skills and* knt>wledges 
underlying job performance. We are all familiar With such analytic tests as they apply to 
selecting a secretary. For a candidate secretary. One might use a typing test, a dictation 
test, and a,/ spelling test. While such tests can provide assurance that necessary individual, 
job skills' are within the candidate's repertoire, they do not assure the person can fit the 
sjcills and 'knowledges together effectively in a job setting, or that the person can or will 
do the many other tasks required on the job. ' , ' 

'For st*ill less cost than analytic tests, there are indirect tests of capabilities, usually 
paper-ahd-pencil tests dealing with incidental infcfrmation about thejob. 

• iTh^ simplest, of the indirect tests are specialized vocabulary tests. For example, a 
good secretary is likely to know what ''plat^n" meftns, and what a number four pencil is, 
and what the term ''stay-back file" means. Since none of these three items of informa- 
tiqn is intrinsically of consequence in doing a good job as a secretary, they constitute 
indirect measures. 

Use of indirect* megisures must be approached with great caution and chqcked 
empirically against more direct measures. This is because possessing such information may; 
not come from job competenee-^witness the fact that I know the meaning of the three 
term^, but I have no secretarial competence (vhatsoever. 

Most cornmon of all as a measurement of job performance in industry is the rating 
scale. The reasons are that, first, it is the least expensive measure and, second, it seems to 
make sense to fo to the day-to-day observer of jolj performance who has "seen job 
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performance with his own eyes over a long period of time." Despite the sensibility and 
low cost of rating scales, they don^t do what most people think they do. Rating 
scales— regardless of what the rater is asked to check-;~provide a measure of an overall 
*^Joe is OK by me," rather than^ how well Joe can perform element&, in his job. 1 do not 
at all intend to deprecate the value of personnel decisions based on ''Joe is OK by me";. I 
wish, however, to emphasize that what is being measured in that fashion differs from 
what is measured by a performance test even ii the terms used are similar. 

There are differences in what is being measured for ail'^he categories named: natural 
observation, job sample tests, analytic tests, indirect tests, and rating scales. Treating 
thejp as alternate techniques to measure the same thing can be severely misleading. It is 
traditional to consider these measurements as altertiatives, diffefring in technique biit not 
\n what is being measured. This inaccurate assumption of equivalence is possible only 
because not enough— and not precise enough— analyses have been performed to define 
purposes of measurement and what is to be measured. 

, Effectiveness of Measurement 

I will now-turn to effectiveness or— in psychometric terms— validity, as it applies to 
the consideration of measurement. f 

I began this paper with the proposition that one first must define carefully and 
analytic^ly the precise purpose of measurement, taking into -ac^^oant* the 6rgahization 
setting; then I pointed out that purpose translates to who will n)ake what decision using^ 
the measurements. Second, I proposed that purpose and decision should be the key 
ingredients in determining what will be' measured, but I only touched on how one go(>s- 
about translating purpose into whstt is measured. I skirted , the transition because' only 
gross and tentative rules* or guidelines are available. Basi(*ally, the measurement specialist 
mu^t— as a first cut— use his best judgment. Since his best judgment may be wrong or may 
be severely distorted by cost or otheif practical considerations,^ it is essential that the 
development of a testing program be viewed as a cyclTc feedback process, or. a cut-and-fit 
process, with a continual flow of 'information on whether decisions using test data are 
good ones. Information on the flaws in sUch decisions provides the means for changing 
the measurement and— over time— shaping measurement to maximum support of the 
decisions that need to'^be made. • , . * ' 

' The term ''validity" in psychology, has ^fnany. meanings— and the meaning v^ios 
depending on the gerson and on the context in which the term is used. For this reason, I 
shall avoid these ambiguities and discuss more broadly what one should consider in 
dealing v^th the effectiveness of measurement. 

The first question to consider is th6 accuracy of the measurement. What are the 
tolerances of tfie emerging numbers? . ' 

. It. IS tempting to propose ''the more accurate the. better." But, that proposal is 
untenable because cost- of measurement increases as requirements for precision increase, 
in the same way as measurement to one-ten thousandth of an inch i * more expensive 
than measurement to the nearest foot. Just as we decide on tolerances fof a length 
measurement by considering our purpose— whether it is watch-making or road-building- 
the preeision needed in psychological measurement depends on the purppse of^ measure- 
ment, that is, the nature of the decision that is to be made. ' 

The second question regarding effectiveness of measurement concerns stability. If 
one retested at some later time, how similar would the measurement numbers l?e to a 
first set of numbers? Psychologists norm'ally call this characteristic "reliability" but, as 
with the term "validity," '"relia.bility" has multiple meanings and" use of the term is more 
likely to confuse, than to clarify. . ' , • 

How much stability is needed? The hoary tradition of psychological .measurement 
includes the rule thut a "correlation of .8 or more is needed for individual decisions; a" 



correlation as low as .3 can be used for group decisions." This sei^^gs as a general rule of 
thumb and, therefore^ canno.t fit anything. Much better than- the all-fitting and hence 
never-fitting rule is the', analysis of purpose and what is to be measured. From analysis of 
the purpose, one can define the kind of stability of measurement (that is needeci. From 
analysis and interpretation of what is being 'measured, one can distinguish between 
stability of measurement as it pertains to mechanics,of measurement and as it {Dertains to 
•the nature of what is being measured. In some instances, stability over time would be 
nonsense, for example. Suppose we administer a typing profici^ency test to a group about 
to begin trailing tn typing. Wouldn't it be foolish to expect test scores secured after 
training to be about the same as the first set? 

The third question under the heading of effectiveness is the pay-off. How much 
better, in practice, are the individual decisions reached u^ing the measurement than those 
reached, without such information? This, question can*1:eadily be cast into terms very 
familiar in industry: How much would it cost to save how many dollars? What is the net 
gain? However, in order to c}o such an analysis, it is absolutely necessary— to revert to my 
main thesis— that the purpose of" measurement be analyzed and defined Very explicitly, 
down to exactly wl^at decisions will who ir^ke using the measurement data. With 
decision.s defined, it is possible— and, perhaps, even routine- to perform a cost- 
effectiveness analysis of psychological measurement, 

Measurement in Industry has enjoyed only 'mixed success at best, and the question 
''Is testing worth it?" addressed to management most often results in the answer ''I don't 
know."^' I think there are two related reasons for this unclear state of affairs. 

First, there are many industrial managers who enter internal, deliberative policy 
councils with a personal conviction that what is really important cannot be mes^sured \)y 
tests and that tests and psychologists are not to be taken seriously. In thaji same council, 
frequently, will be a testing enthusiast and, after a period of wrangling, the traditional 
compromise will occur: "Let's try it out on a small scale." Unfortunately, the small-scale 
approach frequently , leads to skipping the crucial steps of analyses to establish purposes x 
to the level of who will make^what decisions with the information. Therefore, any hope 
of getting a good fix on Exactly wTiat is to be measured is ^sacrificed. Usually, a 
conveniently available test with a name that seems about right and that may have been 
recommen(^ed as a good test is chosen for trial' purposes— whether or not it fits the 
situation and purpose. 

Second, exacerbating the instant magic of choosing a* convenient test is the fact that, 
rather than programing a systematic cut-and-fit program for choosing and/or developing 
measures, a one-shot try out is undertaken. If 'the test passes, it's in; if not, testing is out 
for the company. - , 

Good testing is moro experjsive tKan poor testing or no testing. Analysis to 
determine whether good testing is worth the trouble is not very difficult, once analysis 
and definition have proceeded to the level of who will make .what decision with' the 
information. The costs of poorer decisions in excessive training costs, reduced produc- 
tivity, or costs of firing someone and hiring a replacement can be estimated, at least 
roughly. In addition, costs of developing 'and using a measurement system can also be 
estimated, at least roughly* From such data, one can calculate a break-even point in terms 
of the amount of improvement in decisions that is needed fo recover costs of measurie- 
ment. Usually, since training, selection, hiring, firing, and other consequences of decisions 
are so very expensive, it will be found that even fninis'cule improvement in the quality of 
decisions will more than pay for a good measurement program. 
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In closing, I should like to repeat my main points: 

First, philosophical disputes about whether a person's characteristics can be 
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measured are pointless. Anything that can he specifically defined can be measured. Such 
definitions should be in terms of behaviors that can be observed. 

Second, to develop a testing program that is useful and 'cost-effective, the 
plajnned use pf the test information must 'be carefully defined: that is, who will make 
what decisions using the measurements to be obtained. 

Third, analysis and interpretation of the particular purpose and the particulai* 
industrial setting ai'e. essential to decide, hypothesize, Estimate, or guess what should be 
measured. What are usually considered to be different measuring techniques for the same 
thing are, in fact, measures of different things. 

Fourth, the- effectiveness of measurement* should be 'evaluated in terms of 
precision stability, and amount' of improvement in organizational activities, all of these 
considered in terms of the decisions for which measurement provides support. For 
-maximum return on the testing doH'ar, it is essential 'to proceed cyclically, continually 
improving (the measurement program in. the light of feedback ^on how decisions are 
improved— or not improved— by measurement, data. 

• Fifth, analyses of saving, that can be Accomplished by improved decisions are 
usually startling, producing dollar gains far beyond the cost of developing and employing^ 
measurement. » • 

My main thesis has* been that, measurement must be considered \w the particular 
framework in which it is to be used— and here I arti-- talking about measurement in 
genial! I, therefore, call your attention to Dr. Lyons' presenta^tion on quality control, an 
excellent illustration of the concept of defining who will make what decision using what 
measurement information. 



MEASURING EFFECTIVENESS: QUALITY CONTROL OF TRAINING 



J. Daniel Lyons 



As the philosopher Seneca, observed, "When a man does not know what harbor he is 
making for, ko wind is the right wind." And when training goals have not been precisely 
, defined in terms of ineasurabie on-the-job performance, no training technique is the right 
training technique. The most pervasive weakness of traiping programs is lack of precision 
in locating the harbor of improved, job performance. As a result, they are buffeted 
'constantly by the winds of promise, and innovation in training— but no wind is the right 
wind. " ' r 

Development of new training programs and the introduction of changes in existing 
programs are fruitless exercises unless and until the means for assessing progress toward 
precisely defined goals have been developed. Behavioral psychologists have been .portrayed 
py some critics as *'drab purveyors of the obvious." In this paper, I may well be adding 
/credence to that observation. It is obvious, is it not, that one does not introduce change 
unless there exist mechanisms for assessing the effect of the change?! am in the role of a 
drab .purveyor of that ofivious and fundamental principle. Because in government, 
industry, the public schools, and wherever training and educational programs exist, that 
obyj^^us principle is being continually violated— at a fantastic cost in Wasted dollars and 
human potential. * 

The process, of -/developing the, raw material of human potential deserves a system of 
quality control at least as carefully developed as that applied to the manufacturing 
process. By a quality control system I mean essentially an information system and a 
system of concepts, models, and procedures designed to accomplish four main objectives: 
'(1) Quality assurance 

(2) Control of student progress ' _ ' ^ 

(3) Training program improvement 

(4) Training system, diagnosis and change 
The quality assurance function is illustrated in Figure 1. 

Does the product meet the specifications?*" This question cannot legitimately be 
posed unless and until the specifications have been delineated in terms of operational 
.requirements and these requirements have been reflected in end-bf-course proficiency 
measures. The intent is to rid the training system of criteria based on amount of training 
in favor of demonstrated proficiency in 'the required job elements. Systematic application 
of precise job performance criteria through a quality control system results not only in 
an improved "product, but also in the discarding of irrelevant material. Thus, the cost of 
installing an effective quality control program is amortized through savings in the training 
program, particularly in personnel time of instructors and students. 

The second objective of a quality control system is to provide a means of selecting 
and organizing the learning experiences of the studerrts to facilitate achievement of the 
objectives. 1 

The training program depicted in Figure 2 is composed of a se;ries of segments" or 
modules (ujiper half, Figure 2). Conceptually, these may jbe as long 33 a major phase of 
tho^course, or as short as a single brief lesson. Each suchj segment or module is^designed 
to help the student meet specified learning objectives. ^ j 
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The decision options (lower half, Figure 2), include those of ^sending the stL|(ieia 
forward to the next segment of the course, recycling, x)r giving special corrective training. 
Generating informatiofi to aid in choosing among the options is* a function of a quality 
control system. II should be noted that the opfion of special corrective training is 
*cantTngerit upon the precision of the diagnostic instrun:\ent; that is,, tlje evaluation 
procedure must be^pable of identifying specific weaknesses toward^ which the corrective' 
training can^ be directed, Th6 goal is a system by which the trainee is continuously 
evaluated, selectively corrected, and advanced as performance standards are met, and only 
as they are met. 

The ' first two objectives, quality assurance and control of student ])rogress. are 
concerned with ^assessment of student performance. The third objective, shown in Figure 
3, is that of program improvenient; the emphasis is on prograI7^ assessment rather than 
assessment of the individual trainee. Unfortunately, too often changes in training pro 
grams tend to be based on administrative edict. We are all, familiar with thosc^ frustrating 
'situations in which changes in management bring about changes to conform to the i)ias(\s 
of the new manager; for example, the shifting emphases- on theory and practice 'in the 
training of repairmen cUipen^ing upon the vie^s of up|)er management rather than 
requirements and performance-. A systematic quality control pK^)cess 'that can identify 
weaknesses and strengths in^he progtam by assessing and diagnosing the performance of 
the, trainee providt^s a bulwark against the .shifting winds of administrative edict. Further., 
the' conti^r process is necessary i;i order to assess the effects of changes made to 
str^ngtheo tJie program. The most important motivator that can be supplied to any 
trainer is piTicise and accurate. feedbacle on the results of his efforts. Tf this is supplud. 
training wilt improve, if only by trial and error. * 
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From a Training Director's point of view, Figure 4 may be viewed in the following 
manner. iProm the operational elem,ents of the organization, the training .system receives 
performance requirements that are ricficulously inflated or impossibly vague, which must 
be met with trainees and instructors of minimsd aptitude and experience supplied by the 
Personnel Department, while operating under policies and procedures that are unrealistic, 
or inflexible, or antiquated, or obscure, or all of t^hese, while utilizing outdated equip- 
ment and facilitiels, and operating on a miniscule budget. 
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While that may be the world as seen by those of us concerned with traming, it is 
safe to assume that each of the other elements of this system and management ^ee 
somewhat different worlds. An adequate quality control system can alleviate the resulting^ ^ 
stresses and strains- by providing the information that helps to identify a(nd define the 
problems and to .evaluate- the effects of attempted solutions. 

The braining system is all too often >the scapegoat for pfroblems resulting from poorly 
defined operational .requirements, inappropriate utilization of traiiiing system products, 
inadequate personnel selection procedures, and ill-coneeived personnel policies. A well- 
designed quality control system can serve to put our training house in order and provide 
the basic information for ^productive interaction with other systems in the organization. 
In short, it may get the monkey off our backs or fasten it there, more firmly, if 
appropriate. * v * . ^ 

The essential elem^ts of a quality control system are: 

(1) Training objectives (performance requirements) 

(2) Proficiency and diagnostic me^jsures , 1 

(3) Data reduction and'analysis 

(4) Procedures for.deStSion and corrective action 

(5) Comjnunication procedures^ ^ 
^ . (6) Managerial support , 
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For quality control, crucial information derives directly from trainring objectives. 
They form the keystone for a, useful ahd effective quality control .system by providing 
not* only the specifications for instruction, but also the basis for evaluating instruction. 
Thus, we* must begin with a complete set of good training objectives for a training 
program, and these objectives represent the mission of the treuning system. 

Management plays the beginning role with fegard to training objectives by defining . 
exactly what is to be accomplished by the training system. The raw material for such 
defining comes from, many sources— policies, plans, ^specifications for new equipment,- 
information concerning ^on-the-job performance of earlier graduates, information about 
on-the-job req^uirements, and so forth. - 

The. management element assembles all such information and decides on terminal 
training objectives. In order? for the terminal objectives to be most useful, they should be 
in the form of detailed spedifications. ' . , * 

^With terminal objectives defined, the training operations element is responsible for 
developing detailed training objectives and for providing graduates who can perform as 
defined by nfiana^men.t. The set of terminal objectives forms a complete inventory for 
evaluation. The 'training objectives also include information about the condition^ under 
which tasks are expected to be performed and thereby define test conditions. Further, 
'the training objectives also include the sfendards or tolerances for the tasks .in terms of 
accuracy and speed requirements; these are also tolerances for use in scoring an 
individual's performance on a task. « 

In order -to assess the effectiveness of how the training system is performing, another 
kind 'of information is needed about eacH task— the minimum acceptable percentage qf 
.s*tudents capable of perfbrming within tolerances. '.Cost and time aside, it would be 
desirable for every student to be able to perform every task within the defined 
tolerances. However,, achieving such a goal would be likely to make the cost and time for 
training intolerably large. Something short of 100% of the students capable of 100% of 
the taskV'niust be defined as an acceptable standard of effectiveness of the training 
system. . - ' 

The standard njugt, however, take account- of . the varying criticality ,of the tasks. 
Ninety percent "of electricians being 90% correct in th6 procedures for grounding an 
electrical circuit during repairs is not an acceptable standard.^ Fifty percent knowing the 
correct nomenclature of 50% -of the contents of their tool kits may be acceptable 'on a 
particular job. The critioality measure far any task is basically'an assessment of the effect 
on the operational system of the incorrect performance on that task. In assisting in the 
development of a training program for stock clerks, we found* that the system could 
absorb, with minor turbulence, an error in the nomenclature of an ordered item but that 
the stock number' was highly critical— a misplaced digit oould produce an avalanche of 
toilet paper instead of a fork-lift truck. Similarly, the delivery address was of medium 
criticality, producing serious delay in delivery—but a misreading of the unit of issUe~^and 
we have an avalanche of toilet paper. 

The second element, t^sts and measures, does not make a quality control system- 
yet they are clearly an e^ential element pf any 'such system in order to provide the data 
base on \yhich the sysj^ rests. In quality contirol we are particularly concerned with the 
diagnostic capabilr^y^ our testing* procedures. We must b,e able to pinpoint the strengths 
and weaknesses "of the training for each detailed objective as a basis for decision and 
action to improve or modify the training. In the light of Dr. Cogan's comprehensive 
discussion of tests and measures,' further diScjussion of this topic seems unnecessary. , 

It should be re-emphasized, however, that quality control requires absolute rather 
than relative criteria. Scores anci grades must reflect how many of course objectives have 
been mastered rather than how a student compares with other students. Further, we must 
ensure that we .are n6t wasting our training time* and the potential of our trainees by 
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failing them for the wrong reason. The key is job-relevance of both training and testing. 
If the job requirement is to replace the bad part in a TV set on the basis of observation 
of symptoms, the ability to quote and manipulate Ohm's Law is not job-relevant. Our 
carefully controlled studies document the fact that many potentially excellent electronics 
repairmen in a number of training programs have been discarded because of irrelevant 
weaknesses in physics and mathematics. ^ . • 

The test scores in and of themselves carry little meaning. As a third element, test 
data m,ust be analyzed and interpreted before they can yield meaningful inputs' to - 
decision processes. The data reduction generally involves three kinds of considerations- 
central tendency, variability, and stability. The central tendency is'cal'culated to show the 
overall performance of the. group— average, mean, or perhaps, more useful, the percentage 
of a class able to perform each specific task at or above the minimum standards. The 
variability or spread is generally characterized by calculating the standard deviation, while 
stability is identified by the standard error in order to distinguish the accidental or 
incidental deviations from those that have a **rear' basis.- • 

In the analysis of the data that have been reduced to measures of central tendei},cy, 
variability, and stability, three basic questions arise regarding performance on each' task. 
First, how does the central tendency compare with the standard? Has the class performed 
above, below, or at the^ standard? Second, does the class performance fall within 
tolerances established for the standard? Third, how critical or important, i^ the task to 
operational performance? As indicated Earlier, the criticality of the task^ has dircd 
implications for the urgency of corrective action. The criticality dimension is built into 
the analysis by differential standards and tolerances for specific tasks. 

The collection, reduction, and analysis of the test and performance data are 
necessarily designed to support a program of corrective dctiorfs, the fourth essential 
element of the quality control system. It is, unfortunately,* almost commonplace to find 
massive collections of training data, created at considerable effort and expense, lying idle. 
Too often such data are assembled without a specific plan for utilization or in Uk^ ^ 
absence of specific procedures for implementing the existing plan. Prior to the. collection 
and analysis of the data, there must be procedures for corrective actjon— that is, specifica- 
tion of the process by which decisions are made and means of assigning responsibility for 
implementing the actions selected. These procedures should be designed to identify 
problems and to assign priority to their solution. The highest priority for action is 'for 
those cases where*- the data analysis shows that performance is seriously out of the 
tolerance range. . .. , 

In order to maintain confidence and support of management and of the operating 
elements, it is important that such problems be identified by the tiraining elemeiit and 
corrective action initiated immediately. The system should act rather than react to 
external complaint^. A complete action program should include procedures for: 

(1) Identifying points and places where something seems to be seriously out of 
tolerance and immediate action is indicated. 

(2) Identifying points and places that are **suspicious," and that warrant 
investigation as time and resources permit. 

(3) Establishing a normal routine work load for continuing , study of the 
training program when everything is going well. • 

Obviously a quality control system must include carefully designed communication 
procedures. The information generated by the system must be differently packaged for 
transmission to the responsible individuals on an appropriate Schedule so that the 
necessary decisions can be made on a" timely basis. Equally important are provisions lor 
flow of relevant information into the system— changes in operating procedures, now 
equipment, modifications in personnel selection procedures, policy decisions affecting 
training, and so on. 1 O 
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Proper communication is vital to maintaining managerial support, which is both a 
cause and an effect of a dynamic quality control system. The quality control system 
cannot operate effectively without strong support from all managerial levels, nor will this 
support continue unless the. system operates effectively. Support from management is 
especially needed, because the data produced by the quality control element may be 
unpleasant. However, if the information is directed toward corrective action, quality 
contrql can be viewed as the shared mission of ' management and the training element: 
producing the tangible asset of a well-trained addition to the company work force. 
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DEVELOPING PERFORMANCE^ TESTS FOR TRAINING EVALUATION 



William C. Osborn 



A performance test; is a template— a template modeled from a job task and usec^ to 
gauge the similarity of a trained behavior to the demands of that job task. This view of 
performance test's implies a straightforward approach to their developmeht. One simply 
re-creates the circumstances of the job task, asks the trainee to perform the task, and 
then records that he' did or did not do it. Unfortunately, from* our own experience we 
know that it is not this simple. Many practical problems intervene to complicate the 
process. We often find that a job has so many tasks that days would be needed to test 
them all. Occasionally, the equipment, terrain, and other support requirements prevent a 
realistic "test for evfen a single task. At other timeg, we run into standards of task 
performance that are difficult to translate into a pass-fail criterion fot scoring. We also 
have found that trainers need more than pass-fail results; they need diagnostic informa- 
tion to tell them why their trainees failed-, if they did. 

These 'are some of the major groblems encountered by test developers, though by no 
means all. For the most part, the kinds of test development problems that we encounter 
in the field of training evaluation are not the same as those encountered in the field of 
aptitude testings Thus, we have found the traditional body of academic literature on test 
develqpment to^be poorly suited* to our needs. Certainly the basic notions of reliability 
and validity apply to any test development effort, bu4*in our field, the exotic, sophis- 
ticated formulas that fill most books on test development are of little use. . 

One vitatl need in the field of training evaluation, it seems to me, is a how-to<lo-it 
m*anual for test devefopers— one^ that responds to the* variety of practical constraints and 
problems, that occur in the process of constructing t.ests for the myriad tasks spanned by • 
some eight or nine hundred Army jobs." ^ ' v 

I wish that I had such a manual for you, bat I don't. What I do have is intended to 
be a* step, albeit ^all, in that direction. I have attempted to chart the major action 
points in the course, of developing^ a test^for training evaluation. These st^ps in perform- 
' ance test development ar^ rshowri in Figure 1, and I hope -that you will find it a useful 
framework for discussing the problems and practices of test development. 

' There are two matters of terminology that need clarification. The first has to do 
with the concept of performance testing. I choose to use this concept (at least today) to 
designate the test or tests, normally developed and administered By ^ quality control 
agency on completion of training for the two explicit purposes of qualifying .trainees and 
evaluating training. This type of testing is to Be distinguished from the development' and 
use of tests by trainers for monitoring student progress Within and between stages of 
training. The second is that I use the term test item in referring to the evaluation of 
behavior involved in, a single job task, and the term test in referring to the aggregate of 
these items over an entire job or job^sector purportedly covered by the training program! 
I am qot asking you to agree with these labels, but to bear them in mind for the 
moment. * ^ 

Now let us return to the process of test development as outlined in the fi^i?e. I 
should like to p oceed through the 14 steps, and give a brief summary of my thoughts on 
the "why, what, and how'* of each one. 
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The first three steps on- 'the chart , concern^as|embling infoi<mation that should 
routinely be supplied to the test developer. 'He should oply h^ve to verify completeness 
of the information; "and not make judgments about its accuracy. As stated in the first 
step, test development begins with the objectives for t|i€ job or job sector^for which 
people are to be trained. These are sometifties termeS job objectives"-~'more s often, 
terminal training objective^. Whatever they are called, they arei the master list of 
specifications* derive^ from ..the job, and ffrom ' wtiiclv feo^/i 'training dev^elopers and 
performance test develbpers,, sepatately, begin tl^eir Vork. As test developers,<our goal is 
to develop a performance test item for, each" ^nd every objective, although this is'n^t to 
ixnply that our* final tesf will necessarily encomp^^s all pbjectives.. In addition, e'ach 
.objective should .be accompanied by a supporting list of skill and knowledge requirisments 
to, be used in -latej stages of test development^ ^ . *«- 

The information deTsignated in Step 2 should also be available as a matter of course. 
The relative importance of each* objective, as judged in ,terms of mission capability, 
represents data that is *iecessary in making trade-offs later in the te»t ' development 
process. ^ ' , ' " . ' 

•Step" 3 suggests that^ach objective must' be^trevie-s^^ed to make sure it^s^all there. We 
know* that, in addition to a stated task behavior,' an objective ^hpuld cont^iin stated 
conditions and standards'of performance. If a .y of the three elements are rtiisising, or if 
any are unclear to the test developer, he should get together with the tasK analyst and, as 
indicated in Step 4, obtain a' clear statement Of the missing or confusing elements. 
Performance standards are the most commo'n - source of ttoyble, and if a'^air and 
meaningful pass-fail criterion is to be established fOr a test item,^the developer must have 
an unequivocal standard of task performance to work from. 

In Step 5, test item development really begins. Here, the developer must judge the 
feasibility of duplicating in a test situation the conditions and behavior called fo? in the 
objective. Normally, of course, our view is that well state*d objectives are blueprints for 
^ testing— in fact, . dictating what the test conditioner^ will b^. Occasionally^ howwer, we 
encounter an objective calling for the use of job-jrelevarit ^equipment, terraiq, support 
personnel, or ^ time frame that exceeds the resources'^ available to the test agency. In 
these instances, the developer must carefully weigh the* criticality of the objective (from 
Step 2) against the cost factors«*before deciding that full realism cannot be afforded, 
because invariably some degree of relevance i^ lo.st as one departs from the test specifica- 
tions given in the objective. ' , ' - - i , . 

When it is decided that- the conditions of the objective cknnot be duplicated in the 
test situation, a. substitute technique must be developed, a? indicated in Step 6. This is 
perhaps the .most 'subtle and challenging aspect pf the development process. Here, a 
developer's, inventiveness is often needed in devising a method and conditons for testing 
that will call for the demonstration of a behavior that is as similar as possible to the 
behavior stated in the Objective. Too often in this situation developers resort to paper- 
and-pencil tests measuring kncAvledge of the task, an >^pproach that in most cases can be 
safely rejected out of hariSd. In opnsiderihg simulation,,options developers have a useful 
check available in the task^'s skilL and knowledge requirements. The reljevance of a 
proposed test method may be evaluated by checking the number of skill and knowledge^ 
components of the task that*:are called for in the method. 

Once a task-relevant method of testing is determined, 3tep 5 or Step 6, the 
developer turns his attention to the matter of achieving measurement reliability. In 
Step 7, he must again 'look at the objective in terms of repetitions qr variations oft the 
behavior implied. In most cas^s^this will* be expliditly given. Fpr a specific; skill, such as 
disassembling a' rifle or* installing a darbu^fetor,. a« sing)e demonstration of the behavior- is 
all that is normally called for. On occasion, however; with generalized skills or generalized - 
behaviors, the number of repetitions of the behavior may or may not be clearly stated in 
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the objective. An objective specifying that something wilLbe done correctly 9 out of 10 
times creates no problem for the test item developer, as 10 repetitions are required. On' 
the other handj the standard may be phrased in terms of correct performance on -90% of 
the trials. Here a decision must be reached on an appropriate number of repetitions of 
the performance to ask for ia the test item. More generally, the important consideration 
in Step 7 is whether a large enough sample of trainee performance is being requiJed so 
' that success or failure does not result largely from chance. Here, again, the test developer 
must make some trade-off 'between time or^cost factors and reliability of the measured 
behavior. 

g Step 8 pertains to another aspect of test reliability— the standardization pf the 
conditions under which a test item is administered. Here, the in^portant fact(5rs .are the, 
instructions and environmental conditions under which the test item is given. Instructions 
should be identical for everyone. They should be clearly and simply stated,' leaving 
nothing to the interpretation or misinterpretation of the trainees taking tfie teSt. Thirfgs 
such as the method of scoring and whether speed or accuracy, is important s^Kuld be 
stressed in the instructions. Also, conditions 'pertaining to test supplies and environmental 
factors should^e constant for all personnel. Items of equipment worked with or on 
during tasting should be restored to their pretest condition- if they are used by successive 
trainees. Similarly, environmental factors such as visibility, temperature, attitude of the 
tester, time of day^ and the like, must be stabilized. , * 

In Step Q, a final aspect of measurement reliability is considered. Here procedures 
for translating afl observed trainee performance into a pass-fail scpie must be developed. 
Provision for this type of scoring should be structured so; that only the more reliable 
human skills are used. That is, the scoring activity should be reduced to one of matdhing 
or comparing the test item response with some model of the acceptable response. If the 
model response on a test of rifle marksmanship is defined as a hole in the bullseye, then 
the scorer has a relatively easy task in judging the acceptability of the response made by 
the rifleman. » Unfortunately, responses for many test items cannot be judged in this 
"either/or" fashion, but require a "more-or-lesa" type of judgment. In these cases, thte 
developer should always strive to break down tpfe model response into elements so that 
•comparative judgments can be made more easily by the 3corer. This' may often entail 
preparing a checftlist of the necessary components or features of the model response. 

In Step 10, a supplementary scoring procedure is\ developed for use in diagnosing 
reasons for trainee failure on the test item. Pass-fail scoring is sufficient in' meeting the 
primary mission of quality control, which is the certification of trainee job readiness. 
However, the secondary mission, that of training program evaluation, is best accomplished 
by providing the -trainers not only With the incidence of p^s and failure for an objective; 
but also feedback on wh'^ trainees failed. One way to obtain this^d^ta is through a 
checklist developed from the \skill and knowledge requirements of the task to be used by 
the tester in recording why the trainee failed 'a test item. When* accumulated over a 
number of test item administrations, this diagnostic information will normally provide a 
stable picture of the reasons for failure that trainers may then use to selectivejy revise 
and strengthen their program. ' " ^ . • 

Instep 11, the test developer simply brings together the products of previous steps 
and formats the final test item. Detailed instructions to the tester covering- test materials, 
equipment, procedures, precautions, and so forth, are spelled out. The directions to ♦be 
" read to the trainee by. the tester, and the scoring proqedure should also be written out. 

The final^ three Steps lin the figure pertain* to assembly and administratiop'of the 
final form of the test. In Step 12, a decision is m^de on whether time perrhits testing on 
all objectives-ithat is, administration of all test items. If it is not feasible to do so, an 
appropriate sample o.f test items has to be selectefl (Step 13). As indicated in this st^p, 
the ma\n crijierion for sampling should. derive from criticality ratings of the objectives. An 
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e5^act procedure for doing this will depend^ upon the categpri^s oriteially used for 
♦reporting criticality. Generally, the developer wo\4ld. first include all "esselliar' or highly 
critical items, and then sample from the temaining. Wherever sampling is necessary, the 
usual practice is to vary the sample frpm one administration to the next so that all test 
items are used sooner or later. Variations in the san(iple. should not be .systematic in the 
sense that trainers or trairjees c^ anticipate what items are going to appear. 

In Step 14, final guidance for test administration is prepared. Training^ for testers 
may have to be developed; fists of equipment and materials prepared; and scheduling 
worked out. If testing is to be done individually, it is usually a good idea to prescribe a 
"county fair" layout of test stati^ons. This, serves purposes of economy, as well as 
permitting test items to be administered in Varying order. In addition, security pre- 
cautions must be specified to ensure, for example, that one trainee cannot benefit by 
-Qbserving another^'s performance, or that trainees do not talk among themselves during 

test administration!. ' - • • , ^ 

Consideration of these action points, step by step, constitutes a framework for 
performance test developrnent^ . 
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Guidelines for the Evaluation' of 
Applied Performance Test Materials. and Procedures 
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GUIDELINES FOR THE EVALUATION OF APPLIED 
PERFORMANCE TEST MATERIALS AND PROCEDURES 



General guidelines are proposed for the evaluation of Applied Per- 
formance Testing situations. Since Applied Perforrnance Testing, is con- 
ducted within a wide range of situatiorrs, these guidelines should be 
applied judiciously to individual instances. ^ There may well be times 
that very good Applied Performance Tests do not conform to some of the 
guidelines which are included. In general, however, good Applied Perf or 
mance Tests are expected to demonstrate the- qual ities represented by the 
proposed guidelines. ^ 
Definition : "Applied Performance Testing, for pur- 
poses of the Clearinghouse for Applied Performance . 
Testing (CAPT) project, is defined as the measure- 
ment of performance of some task'sigriificant^to a 
student's life outside the school ^nd/or to adult - ' 
life. Su.ch a task is valued as output for^public 
schools. The testing device must allow for measure- 
. ment of the tctsk in ari actual, or simulated perfor- 
.mance setting'."' * \ '""^ ^ 
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A SUMMARY OUTLINE OF .GUIDELINES FOR, THE EVALUATION OF 
APPLIED PERFORMANCE TE^ MATERIALS AND PROCEDURES 



1.0* Test Background 

r 1.1 Purpase of the Test ^ 

1.2 Test Content ^ ^ 

1.3 Task. or Job Analysis ^ 

1.4 Pilot Testing and- Val idation 
2.0 Characteristics of a Good Test 

^ ♦ 2.1 * Validity 
\ 2.2 Reliability 

2.3 AdeqOacy 

2.4 Objectivity and Standardization 
' '^2.5 Comparability 

2.6 Ef f i ci ency/Piracti cal i ty 
' 2.7 Balance 

2.8 Di f f i cul t/ Di scrimi na ti on 

2.9 Fairness . 

2.10 Speededness 

2.11 Format 

2.12 Relevance 

3.0 Test Administration and Reporting 

3.1 Instructions to the Examiner 

3.2 Instructions to the Examinee 
\ 3.3 Scoring 
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1.0 Test Background ^ . , • - ' 

1,1 Purpose of the Test ' ;\ 

'1.1.1 The purpose of the test-should be explicitly stated 
to'aid understanding by examinees, users of test 
results, and those administering and interpreting 
the test. 

1.1.2 The construction of the test should reflect the pur^ 
pose for which the test Is to' be used. , - ^ , 
1,1,2.1 Selection tests need/to discriminate, vyel? 
around cut-off points but would require 
only limited discriminating power if the 
ends of the distribution standards for 
selection need to be well defined. 
1*1.2.2 Certification tests need only fijscriminate 
. between individual s' who have, and those who 
do not have qertain well defined compe- 
tencies. 

^ 1.1.2.3 Applied performance tests, of diagnostic 
nature, need to cover a ]imited scope but 
in much greater detail. Such tests should 
be designed to yield scores on separate 
parts. The range of item difficulty and 
individual discriminating power is less 
important. 

1.1.2.4 Applied perfomance tests for classifica- 
tion of performers need to have a suffi- 
cient range of item difficulty and indi- 
vidual discriminating pov/er to differentiate 



indi>iduals on a continuum of expected com- 
petencies. Such tests are expected to 
* . yield a single score and are expected to 

' be more g-eneral in nature than tests for 
comprehensive testing and description of 
^competency levels of specific individuals. 

Test Content 

1.2.1 Tests should measure the performance of- some task ""' 
thought to be significant to a student's adult life 
or life outside of school. (Example of tasks: (1) 
read and comprehend the front page of*^ newspaper, 
(2) make change (money), (3) read and follow direc- 
tions on a medicine bottle, (4) complete an applica- 
tion for a job.) 

1.2.2 Tests should provide for the measurement of the task 
in an actual or simulated performance or job setting. 

1.2.3 Tests should measure useful abilities of a practical 
\ * nature that contribute to success in life or success 

in some aspect of the world's work. 

1.2.4 A paper-and-pencil test can be considered the most 
appropriate applied performance test when the, test 
response is identical with the behavior about which 

^ information is desired. For example, a test in ac- ^ 
counting or shorthand would have to use the paper- 
/ and-pencil format, 

1.2.5 Test content should provide reasonable items which 
. sample/depict those behaviors in extracurricular 



or adult life activities that are consistent with 
the social and cultural contexts in which the ac- 
tivities occur. ^ 
1.3 Task of Job Analysis 

r.3.1 A task or job analysis should be referenced when de- 
veloping applied performance tests for testing in 
complex job situations. A job analysis can include 
information about job training, responsibility, job 
knowledge, dexterity and accuracy, and equipment, 
materials, and supplies. Also, information *is need- 
ed on examples of situational factors that are com- 
monly associated with and which may affect task per- 
formance. For example, the condition of a patient 
involved is a highly significant variable in describ- 
ing a task. _ 

1.3.2 The test should show evidence of the representativeness 
and critical ity of tasks and sub-tasks to be measur- ^ 
ed. Sub-tasks should be capable of impact on overall 
task fulfillment. 

1.3.3 If all the elements of a job are not measurable be-^ 
cause of constraints on time or resources, the sam- 
pie performance elements to be observed should be 
identifiable as the-most critical or crucial aspects 
of a job. Critical representativeness of the per- 
formance elements to be s|jjipled should be apparent. 



1,3.4 



There shouldibe information on the relatedness- of 
what is being measured to the kind of information 
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needed such as accuracy in job performance, speed in 
task completion,, and dexterity in^ the use of tools..- 

1.3.5 Should the J'state-of-the-art" deter measurement of • 
the most critical elements of a-job, some references 
should be available to the inherent problems of 
measurement in relation to the critical elements. 

1.3.6 ' When it is difficult to identify actions or behaviors 

that constitute successful performances, there should 
" . be information on the relatedness of such behaviors 
to profiles of persons who are considered competent 
or skilled. * Profiles can be represented in the form 
of task performance checklists. 
lA Pilot Testing and Validation 

1.4.1 Evidence should exist to show that the measure has 
been pilot tested. Particular attention , should be 
paid to, among others, the following criteria: 

(a) Directions are clear and unafnbiguous. 

(b) Rating/scoring procedures are feasible, 
acdurate and objective. 

(c) Time limits, if any, are reasonable and * 
consistent with the objectives. 

1.4.2 Pilot testing should be^ conducted to identify the 
test items which discriminate well between persons 
who are competent at the task and those who are not. 

1.4.3 In developing an occupational competency test, there 

should be evidence that experts in the field scored 

perfect or near perfect scores on the pilot test in 

terms of product.. , ' 
Q 142 ; ^ '-^"^^ 
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1.4.4 As evTdence of specifi'^ity of measures to be obtained 
a near chance score or better. 

1.4.5 Tests should be independently reviewed by (1) employ- 
ers and'^practicioners who will eventually judge the 
competence of a performer and (2)' pane-Is of review- 
ers who should be carefully chosen and their quali- 
factions fully ddcumented. Tests should be reviewed 
for relevance', clarity, feasibility, and appropriate- 
ness of purpose. 

1.4.6 When testing non-verbal skills, pilot testing should 
adequately demonstrate that stuWents with limited 
verbal skills of English-speaking competencies can 
fully understand what is expected of them. 



1 



28 



143 



Characteristics of a Good Test 

2.1 Validity: How well does the test measure what it purports 
to measure? 

2.1.1 Content Validity:* Does the test require a demonstra- 
tion of competencies representative of the knowledge 
and skills required for the task or activity being 
measured? 

2.1.2 Concurrent Validity: Is there evidence of substan- 
tial correlation between the test, especially one 
involving simulation or one shortened to include 
only selected tasks, and a reliable and valid inde-^ 
pendent criterion of performance? 

2.1.3 Predictive Validity: If the test is used for pre- 
diction or selection, is there subsequent evidence 
that the test served as a good predictor of compe- 
tency? 

2.1.4 Claims of validity should be appropriately documented. 
Such claims include, for example, correlations be- 
tween measures of performance on test items repre- 
senting a domain of well-defined test situations and 
one or more measures or performance "on the job" or 

in the "real" world. y 

2.1.5 Whether the actual test situation should differ from 
.the exact 'situation in which the skills would be ap- 
plied depends on the nature of the task. There are 
instances in which* the domain of task conditions is 
so large that training must focus on a general izable 
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principle rather than on teaching a response to 
every possible set of conditionsi. It is assumed 
that the student should be prepared for all possible 
test situations and should not be totally surprised 
by whaf he/she encounters on. a test. 

2.1.6 When direct application of skills or competence is 
not observable/m^^asurable, testing may have to be 
limited to provisions of indirect, inferential evi- ^ 
dence of proficiency, such as possession of the es- 
sential aptitudes and prerequisite -skills. 

2.1.7 The task performance to be observed shoul^^ be so 
highly structured that variation in results can be 
attributed .to different levels of competency in stu- 
dent?* performance of a task and not to be extraneous 
factors related to 1;he measurement instrument of 

' technique. 

2.1.8 When testing in a simulated setting, the simulation 
should be close enough to. reality to be satisfactory 
for. a given purpose. ^ ' 

Reliability: How well does the test measure what it is i^n- 
tended to measure? Are test scores consistent and depend- 
able?* Have those sources of variation which are attribut- 
able to chance been eliminated or controlled as far as prac- 
tical? 

2. 2,1 Documentation of reliability coefficients (parallel 
form., test-retesft, split-half, or internal consis- 
tency), wh*en used appropriately, should provide for 
measures of reliability based on the variance in the 
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.proportions or frequencies of correct, incorrect, and 
not attempted responses^^across equivalent sample test 
i^enjs. (Caution: the split-half and internal con- 
sistency measures' are not appropriate in cases of 
performance tasks with sections that do not involve 
the same skills or that include skills which are dis- 
crete'and not necessarily correlated. If parallel 
forms are- to be used, care must be taken to check on 
the effect of altering variables on the two similar 
test forms. When it is practicable to have stability 
coefficients for measures of an individual person's 
competencies, repeated testing should be conducted. 

2.2.2 When having more than one form of the test is prac- 
ticable, parallel form reliability is the preferred 
form of reliability. 

2.2.3 When 'scoring procedures require -judgment, the reli- 
ability of such procedures should be documented by 
showing the degr'ee to which several independent 
judges score performance in a like manner. > 

Adequacy . , 

2.3.1 The test should be of sufficient length. and scope to 
sample appropriately and faithfully the behavior it 
is designed to measure. 

2.3.2 Tests shou provide, when feasible in terms of time- 
and resources, a sufficient* number of .^varying trials 
to verify the results as good measures rather than 

an accident of chance. 

131 • • •'. 

147 



2.3.3 When practicable, it'.is highly desirable to have 

parallel forms of the test. 
Objectivity and Standardizatjon: The test .should be con- 
strued in such a way as to control or eliminate the'influence 
of random factors, personal opinions,' and unreliable sub- 
jective judgments on the 'finaa resul-ts. 

2.4.1 It should be posrsible to present the task in virtu- 
ally the same manner to each' examinee. , - ^ ^ 

2.4.2 For standardization of Instructions to exam^'nees, a 
tape recording of all instructions should be consid- 
ered as well as an accompanying written text for the 
examinee. 

2.4.3 Any equipment which is 'used-§hou]d be subject to 

* r 

standardization regarding its technical features and 
proposed use. . 

2.4.4 Unintentional clues which are included in the instruc 
tions or in other test items should be meticulously 
checked for and eliminated. 

It 

2.4.5 When- a peY^formance instrument has beeh tested -^or re- 
liability. in a specific testing situation, one must 
be careful about u^ing that instrument in other simi- 
lar situations in which testing conditions differ 
even' slightly, for even minor" variations can place 
test results in question. Fhr example, a test de- 
signed to measure ability to drive in metropolitan 
areas may not be appropriate for measurement of rural 
driving ability. The driving skills are basically 
unchanged, only the situation has been varied. 
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2.4.6 . Acceptable procedures for control! ing* Object! vity 
are: (a) percentage of agreement among independent 
observers, (b) correlation among independent obser- 
verSs and (c) Guilford's analysis of variance among 
raters and ratees. 

2.5 Comparability 

2.5.1 The results of'the individual examinee's test, as 
necessary, should be subject to meaningful and ob- 
jective interpretation through comparison with other 
test scores obtained by the examinee, or comparison 
with predetermined criteria or other standards set 
for the examinee. 

2.5.2 If norms have been established for the test, explicit 
find complete descriptive information about the norm- 
ing population and the procedures used should accom- 
pany the test. * 

2.5.3 Whether the test score interpretation is norm-refer- 
enced or domain-referenced should be determinable 
and consistent with the purpose and type of informa- 
tion desired. 

2.6 Efficiency/ Practicality 

2.6.1 The results of the test should be worth the amount 
of time, effort and money required of both examiner 
and examinee to obtain those results. 

2.6.2 Results should be critically reviewed to determine 

ff a less complex mode of testing could have obtained 
equally useful information. For instance, would a 
paper-and-pencil test have been just as good? Would 
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a simulation test setting have been just as approp- 
riate as a real job setting? Acceptability of the 
paper-and-pencil test or simulation test setting 
should, however, be based on identifiable criteria 
such as the cos*" of an error in a real life situation 
in terms of time, people, and resources. 
2.6.3 Tests which can be administered just as well on a 
group basis should be administered in this manner. 
For the sake of efficiency, some, val idity may have 
to be sacrificed. The decision must be based on 
comparison of the relative advantages of each ap- 
proach. 

- < 

2*6.4 Supplies, tools, and equipment to be used should be 
held to a minimum, but should be adequate to ensure 
a 'realistic measurement of the task. 

2.6.5 When "process" (or work procedure) is not both nec- - 
essary and sufffcient to completing the task "product" 
(outcome), the test should' include provisions for 
measuring the task product. 

2.6.6 For tasks of long deration, in which many people are 
to be tested, a sampling scheme of people at various 
points in itime should increase efficiency when only 
group data are needed. . 

2.6.7 The performance to be observed should involve as lit- 
tle repetition of identical procedures as possible in 
any one testing. A single item, if well constructed, 
can be highly reliable. 

■ \ •• 
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2.6.8 Some correlation should be evident between develop- 
mental costs. and projected relevance , accuracy , and 
use of 'test items over time . Would the items be of 

* use long enough; to justify developmental costs? . 
Would the items be free of errors that could result 
in or contribute to fatality in critical testing 
situations— for instance, in the medical field? 

2.6.9 When the testing situation involves human stress', any 
simulation effort should be considered from a prac- 
tical, legal, and ethtcal standpoint. 

Balance 

2.7.1 The testing time and the importance or significance 
attributed to the results of each task should gener- 
ally correspond to the instructional importance or 
priority of th^ta^k itself. 

Difficulty/Discriminatiljn 

2.8.1 The difficulty 1 e v6^^o f^ the task. its^ complexity 
should be appropriate for the maturit^y of the exam- 
inees. 

2.8.2 Tests which are used for classification and diagnosis 
should include a sufficient number of "easy" items 
^in addition to "hard" items to permit meaningful 
analyses of examinees* strengths and weaknesses. 

2.8.3 Irrelevant sources of difficulty (e.g. , .inappropriate 
reading level, vague directions, . ic'eajf^ il lustra- • 
tions, poor quality of test materials) should be 
eliminated to the extent possible. 
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2.8.4 Except in donjain-refereViced testincr^ tasks which do 
not discriminate (e.g., "everybody can do them" or 
"nobody can do them") should be reevaluated when con- 
sidered for further use. items which do not discrim- 
inate add little to the quality of a test except in 
domain-referenced testing. ' ' 

2.8.5 A test designed to make very fine measurement dis- 
tinctions should contain mul'^iple items of a suffi- 
ci^nt number to permit general conclusions. 

Fairness • 

2.9.1 The verbal factor in tests (reading, writing, speak-^ 
ing) should be minimized in the testing of specific 

. performances except in those cases which require 
pral/written/ reading communication skills. 

2.9.2 Tests should be carefully checked for irrelevant 
sexual bias or content^bias against any ethnic, so- 
cial or geographic group. Such a check could be 
based on reviews and empirical stOdies. 

2.9.3 Tests should measure those skills or a^reas of knowl- 
edge which are based upon instructional objectives 
considered valid for the examinees involved. 

Speededness . 

2.10.1 The time allowed examinees should be appropriate for 
the length of the test. * 

2*10.2 In those cases' where the speed with which a person 
works is not important, the test should allow enough 
time so that all examinees have time to finish the 



2.10.3 In performance test situations where speed is an im- 
portant indicator of competency, the number of items 
completed in a Set time or required length of tim^ " 
shguld be set as a 'measure of successful performance. 

-i 

Examinees should b,e tofd*whether time is a factor in ' 
^ scbring. 

2.11 Format., ^ 

2.11.1 To help examinees develop positive attitudes toward 
testing, and To sustain examinees* interest during 
the testing process, motivational factors such as 
(a) novelty of^stimulus, (b) attractiveness of stimu- 

. lus, and (c) action-orientedness should be incorpor- 
ated in the test situation. 

2.11.2 Procedures for testing in the affective domain should 
be based on an unobtrusive and ethical test method. 
Motivational or attitudinal tasks cannot with cer- 

. tainty be validly tested by conventional means; tasks 
may be better tested Covertly through ongoing obser- 
vation or structured observation during performance 
of some task that is ostensibly being tested. 

2.11.3 When attitudes must be observed and measured in a 
contrived obtrusive setting, the stimulus should in- 
clude provisions for helping examinees respond mean- 
ingfully and realistically. For example, in measur- 
ing value judgments, an examiner might show examinees 
a short film of an emotional situation and ask exam- 
inees to evaluate the situation; in the process of 

evaluation, examinees' value judgments would be more 
.9^. ^ ' -i r><-) 1 53 
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■ meaninflfully elicited than through traditional paper- 
and-pencil inventories, 
2.11.4 Generally speaking, if a task's natural sequence is 
not critically disturbed, it is desirable to have 
test items or tasks progress from simple to difficult, 
gelevance ^ 
2.12.1 Tests in use over time should be periodically re- 
evaluated whenever instructional objectives or per- 
formance requirements are changed to any considerable 




should reflect ci^r rent, updated. 
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Test Administration and Reporting 

3.1 Instructions .to the Examiner • ' ' 
3.1.1 The procedures to be followed by the examiner should 

be clearly specified. - 
^ 3.1.2 All instructions to examiners should be as simple as 
possible. 

3.1.3 -The equipment, facilities, or other materials to be" 

used should be clearly specified, for the examiners. 

r 

3.1.4 Detailed guidance should be given the examiner as to 
\ the type and limits of assistance (oral or other) 

that may be given examinees. 

3.1.5 Detailed guidance should be given the examiner cover- 
ing the physical layout and the management of facili.- 
ties, and the testing time necessary to pnsure that 
examinees are tested fairly, efficiently and without 
jeopardizing test integrity. (This is most impor- 
tant in conducting large-scale, concurrent testing 

- of individuals at multiple test stations.) 

3.1.6 Any potential hazards or safety precautions to be 
taken should be pointed out to examiners. 

3.1.7 Equipment and materials used by successive examiners 
should.. be restored to pre-test, condition for each 
student. 

y 3.1.8 Test users shouldTbe advised in understandable tferms 

of the limits and constraints, applicability, and 
interpretations of test results. 

3.2 Instructions to the Examinees 

- 3.2.1 The purpose of the test should be explained. 
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3.2.2 Time limits, if any, shou.ld be explained.^ 

3.2.3 The equipment, facilities and other materials which 
are available , should be specified for examinees. 

3.2"A Any safety precautions or potential hazards should 
be noted for the examinees. 

3.2.5 The process of answering items or demonstrating com- 
petencies as well as the method of scoring should be 
carefully prescribed. 'V 

3.2.6 Examinees shogld understand Jiow much freedom they 
have in demonstrating competency and whether they 
are subject to penalty for guessing. 

J.2.7 All instructions to examinees should be as simple as 
possible. 

3.2^8 A procedure should be included to ensure that the 
examinees know what they are expected to do. Re- 
sponding to a sample question is an example of such 
a procedure. 

3.2.9 When- it is expected that the test format is new to 
the examinees, they . should be giv'en some practice, 
in advance, using that foVmat. 

Scpring 

3.3.1 Scoring procedures should be standardized and ob- 
jective. 

3.3.2 Wheq completeness of performance is to be observed, 
performance at^^ach checkpoint should be scored as 
"passed" or "fafiled;" each test item should be unam- 
biguously scoy|abJ/e as either correct, incorrect, or 
not attempted. 
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3.3.3 When rating scales are used, rating categories 
should be carefullj^ defined with specific examples ■ 
given as a standard of comparison for feach category; 

^ scale points should be sufficiently discriminating. 

3.3.4 When possible, multiple judges who are well trained 
are preferable to a stngle judge. There may be oc- . 
casions when one well-trained judge is preferred, if 
the quality of other prospective judges' training j's 
questionable. 

3.3.5 Interjudge reliability should be established and 
documented with all scoring procedures. 

3.3.6 If there is more than one judge, each should make 
judgments independently, with subsequent negotiation 
to reach consensus on the rating to be assigned. 

3.3.7 It should be determined in advance whether the "pro- 
cess" or the "product" of the task will be more im- 
portant in scoring. (In many cases some combination 
of the two will determine the score.) 

3.3.8 Generally, both the "quality" of the^.work and the 
. performance "time" considered in scoring are dictated 

'tjy the task; thus, standards for scoring should be 

I 

documented. 

3.3.9 The scoring method for the test ^should be consistent 
with the purpose of the- test. For example, if the 
test is being used to determine examinees' progress 

. over time, can the score information be appropriately 
used to show change in performance over time. Can 
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the score infonnation be used for trend analysis 
over time or with different groups? 

3.3.10 Scoring keys and procedures should be pilot-tested 
and checked for feasibility, clarity and appropriate- 
ness. 

3.3.11 .Maximum use should be made of scoring aid,s, such as 

templates, to further the objectivity of scoring. 

3.3.12 Detailed instructions on how to score the examinees 
and provisions for practice scoring trials should be 
provided. 

3.3.13 The number of tasks to be scored or rated should be 
sufficiently moderate that the rater(s) can score 
accurately, and reliably. 

3.3.14 Specific scoring guidelines, criteria and required 

^ examinee qualifications for scoring on the basis of 

* 

direct observation should be specified. 

3.3.15 The scoring of trivial tasks should be avoided. 

3.3.16 Whenever possible, scoring should be done without ex- 
, aminee identification to minimize biases and incon- 
sistencies. 

3.3.17 To the extent possible the scoring activity should 

be reduced to one of comparing the test item response 
with some model of the acceptable response. If a re- 
sponse cannot easily be judged in a "yes/no" fashion, 
but requires a "more-or-less" judgment, the model re- 
sponse should incl ude enough examples to permit re- 
liable comparative judgments. 



3,3,18 The feasibility of making audio* or video recordings 
of task perfomance should be considered, since thi-s 
permits a m'ore accurate scoring procedure. This is 
particularly useful when the task process is trans- 
* ient or does noferesult in a product that can be ex- 
amined at leisure by the examiner. 
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